| Age | Commit message (Collapse) | Author |
|
Remove the now unused eMAG MIDR check and unused entries from cpu_list[].
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
As a cleanup remove the eMAG ifunc for memset.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
As a cleanup remove the eMAG ifunc for memchr.
Reviewed-by: JiangNing OS<jiangning@amperemail.onmicrosoft.com>
|
|
MSG_EXAMINE has been broadened to allow the signal thread (for
example) to access additional arguments that are passed to
interruptible RPCs in other threads. All architecture specific
variants of intr-msg.h now comply with the revised interface and the
single user of MSG_EXAMINE (report-wait.c) adjusted accordingly.
Message-ID: <20260401194948.90428-2-mike@weatherwax.co.uk>
|
|
The CORE-MATH e756933f improved the error bound in the fast path for
x_0 <= x < 1/4, along with a formal proof [1].
Checked on x86_64-linux-gnu, i686-linux-gnu, aaarch64-linux-gnu,
and arm-linux-gnueabihf.
[1] https://core-math.gitlabpages.inria.fr/sinh.pdf
|
|
Introduced a synthetic architecture preference flag (Prefer_EVEX512)
and enabled it for AMD Zen5 (CPUID Family 0x1A) when AVX-512 is supported.
This flag modifies IFUNC dispatch to prefer 512-bit EVEX variants over
256-bit EVEX variants for string and memory functions on Zen5 processors,
leveraging their native 512-bit execution units for improved throughput.
When Prefer_EVEX512 is set, the dispatcher selects evex512 implementations;
otherwise, it falls back to evex (256-bit) variants.
The implementation updates the IFUNC selection logic in ifunc-avx2.h and
ifunc-evex.h to check for the Prefer_EVEX512 flag before dispatching to
EVEX512 implementations. This change affects six string/memory functions:
- strchr
- strlen
- strnlen
- strrchr
- strchrnul
- memchr
Benchmarks conducted on AMD Zen5 hardware demonstrate significant
performance improvements across all affected functions:
Function Baseline Patched Avg Avg Avg Max
Variant Variant Baseline Patched Change Improve
(ns) (ns) % %
------------+----------+----------+-----------+----------+--------+--------
STRCHR evex evex512 16.408 12.293 25.08% 37.69%
STRLEN evex evex512 16.862 11.436 32.18% 56.74%
STRNLEN evex evex512 18.493 11.762 36.40% 64.40%
STRRCHR evex evex512 15.154 10.874 28.24% 44.38%
STRCHRNUL evex evex512 16.464 12.605 23.44% 45.56%
MEMCHR evex evex512 9.984 8.268 17.19% 39.99%
Additionally, a tunable option (glibc.cpu.x86_cpu_features.preferred)
is provided to allow runtime control of the Prefer_EVEX512 flag for testing
and compatibility.
Reviewed-by: Ganesh Gopalasubramanian <Ganesh.Gopalasubramanian@amd.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
The new test from 19781c2221 triggers a failure on i686:
testing float (without inline functions)
Failure: lgamma (0x3.12be38p+120): errno set to 0, expected 34 (ERANGE)
Failure: lgamma_upward (0x3.12be38p+120): errno set to 0, expected 34 (ERANGE)
Use math_narrow_eval on the multiplication to force the expected
precision.
Checked on i686-linux-gnu.
|
|
This is similar to original CORE-MATH code and why the function
exists.
Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
and arm-linux-gnueabihf.
|
|
Add implies, abilist, c++-types and syscall files.
|
|
|
|
|
|
Move the loongarch64 implementation to sysdeps/loongarch/lp64/fpu.
|
|
|
|
The libc_feupdateenv_test macro is supposed to trap when the trap for a
previously held exception is enabled. But
libc_feupdateenv_test_loongarch wasn't doing it properly: the comment
claims "setting of the cause bits" would cause "the hardware to generate
the exception" but that's simply not true for the LoongArch movgr2fcsr
instruction.
To fix the issue, we need to call __feraiseexcept in case a held exception
is enabled to trap.
Reviewed-by: caiyinyu <caiyinyu@loongson.cn>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
|
|
The comment explaining the reason to clear CAUSE does not make any
sense: it says the next "CTC" instruction would raise the FP exception
of which both the CAUSE and ENABLE bits are set, but LoongArch does not
have the CTC instruction. LoongArch has the movgr2fcsr instruction but
movgr2fcsr never raises any FP exception, different from the MIPS CTC
instruction.
So we don't really need to care CAUSE at all.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
|
|
This patch from Adhemerval sets up the ifunc redirections so that we
resolve memcpy to memcpy_generic in early startup. This avoids infinite
recursion for memcpy calls before the loader is fully initialized.
Tested-by: Jeff Law <jeffrey.law@oss.qualcomm.com>
|
|
Detect clang explicitly and apply compiler-specific version checks for
RVV support.
Signed-off-by: Zihong Yao <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
|
|
It syncs with CORE-MATH 9a75500ba1831 and 20d51f2ee.
Checked on aarch64-linux-gnu.
|
|
|
|
It removes some unnecessary corner-case checks and uses a slightly
different binary algorithm for the hard-case database binary search.
Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.
|
|
It adds a minor optimization on fast path.
Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.
|
|
The libgcc implementations of __builtin_clzl/__builtin_ctzl may require
access to additional data that is not marked as hidden, which could
introduce additional GOT indirection and necessitate RELATIVE relocs.
And the RELATIVE reloc is an issue if the code is used during static-pie
startup before self-relocation (for instance, during an assert).
For this case, the ABI can add a string-bitops.h header that defines
HAVE_BITOPTS_WORKING to 0. A configure check for this issue is tricky
because it requires linking against the standard libraries, which
create many RELATIVE relocations and complicate filtering those that
might be created by the builtins.
The fallback is disabled by default, so no target is affected.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
Remove the prefer_sve_ifuncs CPU feature since it was intended for older
kernels. Current distros all use modern Linux kernels with improved support
for SVE save/restore, making this check redundant.
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
|
|
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
|
|
The inclusion of generic tanh implementation without undefining the
libm_alias_double (to provide the __tanh_sse2 implementation) makes
the exported tanh symbol pointing to SSE2 variant.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
The cosh shows an improvement of about ~35% when building for
x86_64-v3.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
Common data definitions are moved to e_coshsinh_data, cosh only
data is moved to e_cosh_data, sinh to e_sinh_data, and tanh to
e_tanh_data.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,-10], [-10,10], [10,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MAX, -10]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
* Range [-10, -10]
* FE_TONEAREST
0: 4059325526 94.51%
1: 231023238 5.38%
2: 4618531 0.11%
* FE_UPWARD
0: 2106654900 49.05%
1: 2145413180 49.95%
2: 40847554 0.95%
3: 2051661 0.05%
* FE_DOWNWARD
0: 2106618401 49.05%
1: 2145409958 49.95%
2: 40880992 0.95%
3: 2057944 0.05%
* FE_TOWARDZERO
0: 4061659952 94.57%
1: 221006985 5.15%
2: 12285512 0.29%
3: 14846 0.00%
* Range [10, DBL_MAX]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Performance-wise, it shows:
latency master patched improvement
x86_64 109.7420 184.5950 -68.21%
x86_64v2 109.1230 187.1890 -71.54%
x86_64v3 99.4471 49.1104 50.62%
aarch64 43.0474 32.2933 24.98%
armhf-vpfv4 41.0954 35.8473 12.77%
powerpc64le 27.3282 22.7134 16.89%
reciprocal-throughput master patched improvement
x86_64 42.5562 158.1820 -271.70%
x86_64v2 42.5734 159.2560 -274.07%
x86_64v3 35.9899 24.2877 32.52%
aarch64 24.7660 22.8466 7.75%
armhf-vpfv4 27.0251 25.8150 4.48%
powerpc64le 11.7350 11.2504 4.13%
* x86_64: gcc version 15.2.1 20260112, Ryzen 9 5900X, --disable-multi-arch
* aarch64: gcc version 15.2.1 20251105, Neoverse-N1
* armv7a-vpfv4: gcc version 15.2.1 20251105, Neoverse-N1
* powerpc64le: gcc version 15.2.1 20260128, POWER10
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
It improves throughput from 8 to 18% and latency from 1 to 10%,
dependending of the ABI.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,-10], [-10,10], [10,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MAX, -10]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
* Range [-10, -10]
* FE_TONEAREST
0: 3169388892 73.79%
1: 1125270674 26.20%
2: 307729 0.01%
* FE_UPWARD
0: 1450068660 33.76%
1: 2146926394 49.99%
2: 697404986 16.24%
3: 567255 0.01%
* FE_DOWNWARD
0: 1449727976 33.75%
1: 2146957381 49.99%
2: 697719649 16.25%
3: 562289 0.01%
* FE_TOWARDZERO
0: 2519351889 58.66%
1: 1773434502 41.29%
2: 2180904 0.05%
* Range [10, DBL_MAX]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Performance-wise, it shows:
latency master patched improvement
x86_64 101.0710 129.4710 -28.10%
x86_64v2 101.1810 127.6370 -26.15%
x86_64v3 96.0685 48.5911 49.42%
aarch64 41.4229 22.3971 45.93%
armhf-vpfv4 42.8620 25.6011 40.27%
powerpc64le 29.2630 13.1450 55.08%
reciprocal-throughput master patched improvement
x86_64 42.6895 105.7150 -147.64%
x86_64v2 42.7255 104.7480 -145.17%
x86_64v3 39.6949 25.9087 34.73%
aarch64 26.0104 19.2236 26.09%
armhf-vpfv4 29.4362 23.6350 19.71%
powerpc64le 12.9170 8.34582 35.39%
* x86_64: gcc version 15.2.1 20260112, Ryzen 9 5900X, --disable-multi-arch
* aarch64: gcc version 15.2.1 20251105, Neoverse-N1
* armv7a-vpfv4: gcc version 15.2.1 20251105, Neoverse-N1
* powerpc64le: gcc version 15.2.1 20260128, POWER10
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
It improves throughout from 3.5% to 9%.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,-10], [-10,10], [10,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MAX, -10]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
* Range [-10, -10]
* FE_TONEAREST
0: 3291614060 76.64%
1: 1003353235 23.36%
* FE_UPWARD
0: 2295272497 53.44%
1: 1999675198 46.56%
2: 19600 0.00%
* FE_DOWNWARD
0: 2294966533 53.43%
1: 1999981461 46.57%
2: 19301 0.00%
* FE_TOWARDZERO
0: 2306015780 53.69%
1: 1988942093 46.31%
2: 9422 0.00%
* Range [10, DBL_MAX]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Performance-wise, it shows:
latency master patched improvement
x86_64 52.1066 126.4120 -142.60%
x86_64v2 49.5781 119.8520 -141.74%
x86_64v3 45.0811 50.5758 -12.19%
aarch64 19.9977 21.7814 -8.92%
armhf-vpfv4 20.5969 27.0479 -31.32%
powerpc64le 12.6405 13.6768 -8.20%
reciprocal-throughput master patched improvement
x86_64 18.4833 102.9120 -456.78%
x86_64v2 17.5409 99.5179 -467.35%
x86_64v3 18.9187 25.3662 -34.08%
aarch64 10.9045 18.8217 -72.60%
armhf-vpfv4 15.7430 24.0822 -52.97%
powerpc64le 5.4275 8.1269 -49.73%
* x86_64: gcc version 15.2.1 20260112, Ryzen 9 5900X, --disable-multi-arch
* aarch64: gcc version 15.2.1 20251105, Neoverse-N1
* armv7a-vpfv4: gcc version 15.2.1 20251105, Neoverse-N1
* powerpc64le: gcc version 15.2.1 20260128, POWER10
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
|
|
The last uses of PTHREAD_IN_LIBC is where it should have been
__PTHREAD_NPTL/HTL. The latter was not conveniently available everywhere.
Defining it from config.h makes things simpler.
|
|
nptl is now always in libc.
|
|
htl can now have it directly in ld.so
|
|
|
|
|
|
Like nptl does, so we really get rwlock behavior.
|
|
We cannot use pthread_rwlock for these until we have reimplemented
pthread_rwlock with gsync, so fork __libc_rwlock off for now.
|
|
The Linux implementation of __check_pf retrieves interface data via
make_request, which queries the kernel via netlink. The IFA_ADDRESS
received from the kernel's RTM_NEWADDR netlink message is (a)
type-punned via pointer-casting leading to strict aliasing violations,
and (b) dereferenced assuming that it is non-NULL.
This commit removes the strict-aliasing violations using memcpy, and
adds an assert that the address is indeed non-NULL before dereferencing
it.
Reported-by: Siteshwar Vashisht <svashisht@redhat.com>
Reviewed-by: Sam James <sam@gentoo.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
GCC warns about this with -Wshift-negative-value:
In file included from ../sysdeps/x86/cpu-features.c:24:
../sysdeps/x86/dl-cacheinfo.h: In function ‘get_common_cache_info’:
../sysdeps/x86/dl-cacheinfo.h:913:45: warning: left shift of negative value [-Wshift-negative-value]
913 | count_mask = ~(-1 << (count_mask + 1));
| ^~
../sysdeps/x86/dl-cacheinfo.h:930:45: warning: left shift of negative value [-Wshift-negative-value]
930 | count_mask = ~(-1 << (count_mask + 1));
| ^~
This is because C23 § 6.5.8 specifies that this is undefined behavior.
We can cast it to unsigned which would be equivelent to UINT_MAX.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
The 'unwind-link' facility allows glibc to support thread cancellation
and exit (pthread_cancel, pthread_exiti, backtrace) by dynamically
loading the unwind library at runtime, preventing a hard dependency on
libgcc_s within libc.so.
When building with libunwind (for clang/LLVM toolchains [1]), two
assumptions in the existing code break:
1. The runtime library is libunwind.so instead of libgcc_s.so.
2. libgcc relies on __gcc_personality_v0 to handle unwinding mechanics.
libunwind exposes the standard '_Unwind_*' accessors directly.
This patch adapts `unwind-link` to handle both environments based on
the HAVE_CC_WITH_LIBUNWIND configuration:
* The UNWIND_SONAME macro now selects between LIBGCC_S_SO and
LIBUNWIND_SO.
* For libgcc, it continues to resolve `__gcc_personality_v0`.
* For libunwind, it instead resolves the standard
_Unwind_GetLanguageSpecificData, _Unwind_SetGR, _Unwind_SetIP,
and _Unwind_GetRegionStart helpers.
* unwind-resume.c is updated to implement wrappers for these
accessors that forward calls to the dynamically loaded function
pointers, effectively shimming the unwinder.
Tests and Makefiles are updated to link against `$(libunwind)` where
appropriate.
Reviewed-by: Sam James <sam@gentoo.org>
[1] https://github.com/libunwind/libunwind
|
|
In LoongArch, fcsr1 is the alias of enables field in fcsr0, fscr3 is the
alias of RM field in fscr0. This patch use fcsr1 and fcsr3 register to
optimize fedisableexcept, feenableexcept, fegetexcept, fegetround,
fesetround, get_rounding_mode functions, which could reduce the
additional andi instruction.
|
|
The LLVM compiler-rt builtins library does not currently provide an
implementation for __sfp_handle_exceptions. On x86_64, this causes
unresolved symbol errors when building glibc in environments that
exclude libgcc.
This patch implements __sfp_handle_exceptions specifically for x86_64,
bridging the gap for non-GNU compiler runtimes.
The implementation is used conditionally, only if the compiler does
not already provide the symbol.
NB: the implementation is based on libgcc and raises bosh SSE and i387
exceptions (different that the one from 460ee50de054396cc9791ff4)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
This commit introduces extensive debug logging for thread-local storage
(TLS) operations within the dynamic linker. When `LD_DEBUG=tls` is
enabled, messages are printed for:
- TLS module assignment and release.
- DTV (Dynamic Thread Vector) resizing events.
- TLS block allocations and deallocations.
- `__tls_get_addr` slow path events (DTV updates, lazy allocations, and
static TLS usage).
The log format is standardized to use a "tls: " prefix and identifies
modules using the "modid %lu" convention. To aid in debugging
multithreaded applications, thread-specific logs include the Thread
Control Block (TCB) address to identify the context of the operation.
A new test module `tst-tls-debug-mod.c` and a corresponding shell script
`tst-tls-debug-recursive.sh` have been added. Additionally, the existing
`tst-dl-debug-tid` NPTL test has been updated to verify these TLS debug
messages in a multithreaded context.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
We need to tie the fast-path read with the store, to make sure that when
fast-reading 1, we see all the effects performed by the init routine.
(and we don't need a full barrier, only an acquire/release pair is
needed)
Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant
|
|
pthread_mutex_unlock sets __owner_id to NOTRECOVERABLE_ID
Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant
|
|
It is supposed to return an error code, not just -1.
Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant
|
|
sigtimedwait also needs to clean up preemptors and the blocked mask before
returning EAGAIN.
Also add some sigtimedwait testing.
|