Shift a left by bytes bytes while shifting in zeros.
Shift v right by bytes bytes while shifting in zeros.
Add packed 16-bit integers in a and b.
Add packed 32-bit integers in a and b.
Add packed 64-bit integers in a and b.
Add packed 8-bit integers in a and b.
Add packed double-precision (64-bit) floating-point elements in a and b.
Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst, and copy the upper element from a to the upper element of destination.
Add 64-bit integers a and b.
Add packed 16-bit integers in a and b using signed saturation.
Add packed 8-bit signed integers in a and b using signed saturation.
Add packed unsigned 16-bit integers in a and b using unsigned saturation.
Add packed 8-bit unsigned integers in a and b using unsigned saturation.
Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise AND of 128 bits (representing integer data) in a and b.
Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b.
Compute the bitwise NOT of 128 bits (representing integer data) in a and then AND with b.
Average packed unsigned 16-bit integers in a and b.
Average packed unsigned 8-bit integers in a and b.
Cast vector of type __m128d to type __m128. Note: Also possible with a regular cast(__m128)(a).
Cast vector of type __m128d to type __m128i. Note: Also possible with a regular cast(__m128i)(a).
Cast vector of type __m128 to type __m128d. Note: Also possible with a regular cast(__m128d)(a).
Cast vector of type __m128 to type __m128i. Note: Also possible with a regular cast(__m128i)(a).
Cast vector of type __m128i to type __m128d. Note: Also possible with a regular cast(__m128d)(a).
Cast vector of type __m128i to type __m128. Note: Also possible with a regular cast(__m128)(a).
Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
Compare packed 16-bit integers in a and b for equality.
Compare packed 32-bit integers in a and b for equality.
Compare packed 8-bit integers in a and b for equality.
Compare packed double-precision (64-bit) floating-point elements in a and b for equality.
Compare the lower double-precision (64-bit) floating-point elements in a and b for equality, store the result in the lower element, and copy the upper element from a.
Compare packed 16-bit integers elements in a and b for greater-than-or-equal. #BONUS
Compare packed double-precision (64-bit) floating-point elements in a and b for greater-than-or-equal.
Compare the lower double-precision (64-bit) floating-point elements in a and b for greater-than-or-equal, store the result in the lower element, and copy the upper element from a.
Compare packed 16-bit integers in a and b for greater-than.
Compare packed 32-bit integers in a and b for greater-than.
Compare packed 8-bit integers in a and b for greater-than.
Compare packed double-precision (64-bit) floating-point elements in a and b for greater-than.
Compare the lower double-precision (64-bit) floating-point elements in a and b for greater-than, store the result in the lower element, and copy the upper element from a.
Compare packed 16-bit integers elements in a and b for greater-than-or-equal. #BONUS
Compare packed double-precision (64-bit) floating-point elements in a and b for less-than-or-equal.
Compare the lower double-precision (64-bit) floating-point elements in a and b for less-than-or-equal, store the result in the lower element, and copy the upper element from a.
Compare packed 16-bit integers in a and b for less-than.
Compare packed 32-bit integers in a and b for less-than.
Compare packed 8-bit integers in a and b for less-than.
Compare packed double-precision (64-bit) floating-point elements in a and b for less-than.
Compare the lower double-precision (64-bit) floating-point elements in a and b for less-than, store the result in the lower element, and copy the upper element from a.
Compare packed double-precision (64-bit) floating-point elements in a and b for not-equal.
Compare the lower double-precision (64-bit) floating-point elements in a and b for not-equal, store the result in the lower element, and copy the upper element from a.
Compare packed double-precision (64-bit) floating-point elements in a and b for not-greater-than-or-equal.
Compare the lower double-precision (64-bit) floating-point elements in a and b for not-greater-than-or-equal, store the result in the lower element, and copy the upper element from a.
Compare packed double-precision (64-bit) floating-point elements in a and b for not-greater-than.
Compare the lower double-precision (64-bit) floating-point elements in a and b for not-greater-than, store the result in the lower element, and copy the upper element from a.
Compare packed double-precision (64-bit) floating-point elements in a and b for not-less-than-or-equal.
Compare the lower double-precision (64-bit) floating-point elements in a and b for not-less-than-or-equal, store the result in the lower element, and copy the upper element from a.
Compare packed double-precision (64-bit) floating-point elements in a and b for not-less-than.
Compare the lower double-precision (64-bit) floating-point elements in a and b for not-less-than, store the result in the lower element, and copy the upper element from a.
Compare packed double-precision (64-bit) floating-point elements in a and b to see if neither is NaN.
Compare the lower double-precision (64-bit) floating-point elements in a and b to see if neither is NaN, store the result in the lower element, and copy the upper element from a to the upper element.
Compare packed double-precision (64-bit) floating-point elements in a and b to see if either is NaN.
Compare the lower double-precision (64-bit) floating-point elements in a and b to see if either is NaN, store the result in the lower element, and copy the upper element from a to the upper element.
Compare the lower double-precision (64-bit) floating-point element in a and b for equality, and return the boolean result (0 or 1).
Compare the lower double-precision (64-bit) floating-point element in a and b for greater-than-or-equal, and return the boolean result (0 or 1).
Compare the lower double-precision (64-bit) floating-point element in a and b for greater-than, and return the boolean result (0 or 1).
Compare the lower double-precision (64-bit) floating-point element in a and b for less-than-or-equal.
Compare the lower double-precision (64-bit) floating-point element in a and b for less-than, and return the boolean result (0 or 1).
Compare the lower double-precision (64-bit) floating-point element in a and b for not-equal, and return the boolean result (0 or 1).
Convert packed 32-bit integers in a to packed double-precision (64-bit) floating-point elements.
Convert packed 32-bit integers in a to packed single-precision (32-bit) floating-point elements.
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers.
Convert packed double-precision (64-bit) floating-point elements in v to packed 32-bit integers
Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements.
Convert packed 32-bit integers in v to packed double-precision (64-bit) floating-point elements.
Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers
Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements.
Copy the lower double-precision (64-bit) floating-point element of a.
Convert the lower double-precision (64-bit) floating-point element in a to a 32-bit integer.
Convert the lower double-precision (64-bit) floating-point element in a to a 64-bit integer.
Convert the lower double-precision (64-bit) floating-point element in b to a single-precision (32-bit) floating-point element, store that in the lower element of result, and copy the upper 3 packed elements from a to the upper elements of result.
Get the lower 32-bit integer in a.
Get the lower 64-bit integer in a.
Convert the signed 32-bit integer b to a double-precision (64-bit) floating-point element, store that in the lower element of result, and copy the upper element from a to the upper element of result.
Copy 32-bit integer a to the lower element of result, and zero the upper elements.
Convert the signed 64-bit integer b to a double-precision (64-bit) floating-point element, store the result in the lower element of result, and copy the upper element from a to the upper element of result.
Copy 64-bit integer a to the lower element of result, and zero the upper element.
Convert the lower single-precision (32-bit) floating-point element in b to a double-precision (64-bit) floating-point element, store that in the lower element of result, and copy the upper element from a to the upper
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation. Put zeroes in the upper elements of result.
Convert packed double-precision (64-bit) floating-point elements in v to packed 32-bit integers with truncation.
Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation.
Convert the lower double-precision (64-bit) floating-point element in a to a 32-bit integer with truncation.
Convert the lower double-precision (64-bit) floating-point element in a to a 64-bit integer with truncation.
Convert the lower single-precision (32-bit) floating-point element in a to a 64-bit integer with truncation.
Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b.
Extract a 16-bit integer from v, selected with index. Warning: the returned value is zero-extended to 32-bits.
Copy v, and insert the 16-bit integer i at the location specified by index.
Perform a serializing operation on all load-from-memory instructions that were issued prior to this instruction. Guarantees that every load instruction that precedes, in program order, is globally visible before any load instruction which follows the fence in program order.
Load 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Load a double-precision (64-bit) floating-point element from memory into both elements of dst. mem_addr does not need to be aligned on any particular boundary.
Load a double-precision (64-bit) floating-point element from memory into the lower of result, and zero the upper element. mem_addr does not need to be aligned on any particular boundary.
Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Load a double-precision (64-bit) floating-point element from memory into the upper element of result, and copy the lower element from a to result. mem_addr does not need to be aligned on any particular boundary.
Load 64-bit integer from memory into the first element of result. Zero out the other. Note: strange signature since the memory doesn't have to aligned, and should point to addressable 64-bit, not 128-bit. You may use _mm_loadu_si64 instead.
Load a double-precision (64-bit) floating-point element from memory into the lower element of result, and copy the upper element from a to result. mem_addr does not need to be aligned on any particular boundary.
Load 2 double-precision (64-bit) floating-point elements from memory into result in reverse order. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Load 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from memory. mem_addr does not need to be aligned on any particular boundary.
Load 128-bits of integer data from memory. mem_addr does not need to be aligned on any particular boundary.
Load unaligned 16-bit integer from memory into the first element, fill with zeroes otherwise.
Load unaligned 32-bit integer from memory into the first element of result.
Load unaligned 64-bit integer from memory into the first element of result. Upper 64-bit is zeroed.
Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in destination.
Conditionally store 8-bit integer elements from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element) and a non-temporal memory hint. mem_addr does not need to be aligned on any particular boundary.
Compare packed signed 16-bit integers in a and b, and return packed maximum values.
Compare packed unsigned 8-bit integers in a and b, and return packed maximum values.
Compare packed double-precision (64-bit) floating-point elements in a and b, and return packed maximum values.
Compare the lower double-precision (64-bit) floating-point elements in a and b, store the maximum value in the lower element of result, and copy the upper element from a to the upper element of result.
Perform a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior to this instruction. Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.
Compare packed signed 16-bit integers in a and b, and return packed minimum values.
Compare packed unsigned 8-bit integers in a and b, and return packed minimum values.
Compare packed double-precision (64-bit) floating-point elements in a and b, and return packed minimum values.
Compare the lower double-precision (64-bit) floating-point elements in a and b, store the minimum value in the lower element of result, and copy the upper element from a to the upper element of result.
Copy the lower 64-bit integer in a to the lower element of result, and zero the upper element.
Move the lower double-precision (64-bit) floating-point element from b to the lower element of result, and copy the upper element from a to the upper element of dst.
Create mask from the most significant bit of each 16-bit element in v. #BONUS
Create mask from the most significant bit of each 8-bit element in v.
Set each bit of mask result based on the most significant bit of the corresponding packed double-precision (64-bit) loating-point element in v.
Copy the lower 64-bit integer in v.
Copy the 64-bit integer a to the lower element of dest, and zero the upper element.
Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results.
Multiply packed double-precision (64-bit) floating-point elements in a and b, and return the results.
Multiply the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of result, and copy the upper element from a to the upper element of result.
Multiply the low unsigned 32-bit integers from a and b, and get an unsigned 64-bit result.
Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and return the high 16 bits of the intermediate integers.
Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and return the high 16 bits of the intermediate integers.
Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and return the low 16 bits of the intermediate integers.
Compute the bitwise NOT of 128 bits in a. #BONUS
Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise OR of 128 bits (representing integer data) in a and b.
Convert packed signed 16-bit integers from a and b to packed 8-bit integers using signed saturation.
Convert packed signed 32-bit integers from a and b to packed 16-bit integers using signed saturation.
Convert packed signed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation.
Provide a hint to the processor that the code sequence is a spin-wait loop. This can help improve the performance and power consumption of spin-wait loops.
Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in result.
Broadcast 16-bit integer a to all elements of dst.
Broadcast 32-bit integer a to all elements.
Broadcast 64-bit integer a to all elements.
Broadcast 64-bit integer a to all elements
Broadcast 8-bit integer a to all elements.
Set packed 16-bit integers with the supplied values.
Set packed 32-bit integers with the supplied values.
Set packed 64-bit integers with the supplied values.
Set packed 64-bit integers with the supplied values.
Set packed 8-bit integers with the supplied values.
Set packed double-precision (64-bit) floating-point elements with the supplied values.
Broadcast double-precision (64-bit) floating-point value a to all element.
Copy double-precision (64-bit) floating-point element a to the lower element of result, and zero the upper element.
Set packed 16-bit integers with the supplied values in reverse order.
Set packed 32-bit integers with the supplied values in reverse order.
Set packed 64-bit integers with the supplied values in reverse order.
Set packed 8-bit integers with the supplied values in reverse order.
Set packed double-precision (64-bit) floating-point elements with the supplied values in reverse order.
Return vector of type __m128d with all elements set to zero.
Return vector of type __m128i with all elements set to zero.
Shuffle 32-bit integers in a using the control in imm8.
Shuffle double-precision (64-bit) floating-point elements using the control in imm8.
Shuffle 16-bit integers in the high 64 bits of a using the control in imm8. Store the results in the high 64 bits of result, with the low 64 bits being copied from from a to result. See also: _MM_SHUFFLE.
Shuffle 16-bit integers in the low 64 bits of a using the control in imm8. Store the results in the low 64 bits of result, with the high 64 bits being copied from from a to result.
Shift packed 16-bit integers in a left by count while shifting in zeros.
Shift packed 32-bit integers in a left by count while shifting in zeros.
Shift packed 64-bit integers in a left by count while shifting in zeros.
Shift packed 16-bit integers in a left by imm8 while shifting in zeros.
Shift packed 32-bit integers in a left by imm8 while shifting in zeros.
Shift packed 64-bit integers in a left by imm8 while shifting in zeros.
Shift a left by bytes bytes while shifting in zeros.
Compute the square root of packed double-precision (64-bit) floating-point elements in vec.
Compute the square root of the lower double-precision (64-bit) floating-point element in b, store the result in the lower element of result, and copy the upper element from a to the upper element of result.
Shift packed 16-bit integers in a right by count while shifting in sign bits.
Shift packed 32-bit integers in a right by count while shifting in sign bits.
Shift packed 16-bit integers in a right by imm8 while shifting in sign bits.
Shift packed 32-bit integers in a right by imm8 while shifting in sign bits.
Shift packed 16-bit integers in a right by imm8 while shifting in zeros.
Shift packed 32-bit integers in a right by imm8 while shifting in zeros.
Shift packed 64-bit integers in a right by imm8 while shifting in zeros.
Shift v right by bytes bytes while shifting in zeros. #BONUS
Shift v right by bytes bytes while shifting in zeros. #BONUS
Shift v right by bytes bytes while shifting in zeros.
Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Store the lower double-precision (64-bit) floating-point element from a into 2 contiguous elements in memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Store the lower double-precision (64-bit) floating-point element from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 128-bits of integer data from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Store the upper double-precision (64-bit) floating-point element from a into memory.
Store the lower double-precision (64-bit) floating-point element from a into memory.
Store 2 double-precision (64-bit) floating-point elements from a into memory in reverse order. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 128-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 16-bit integer from the first element of a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 32-bit integer from the first element of a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 64-bit integer from the first element of a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 128-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 32-bit integer a into memory using a non-temporal hint to minimize cache pollution. If the cache line containing address mem_addr is already in the cache, the cache will be updated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 64-bit integer a into memory using a non-temporal hint to minimize cache pollution. If the cache line containing address mem_addr is already in the cache, the cache will be updated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Subtract packed 16-bit integers in b from packed 16-bit integers in a.
Subtract packed 32-bit integers in b from packed 32-bit integers in a.
Subtract packed 64-bit integers in b from packed 64-bit integers in a.
Subtract packed 8-bit integers in b from packed 8-bit integers in a.
Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a.
Subtract the lower double-precision (64-bit) floating-point element in b from the lower double-precision (64-bit) floating-point element in a, store that in the lower element of result, and copy the upper element from a to the upper element of result.
Subtract 64-bit integer b from 64-bit integer a.
Subtract packed signed 16-bit integers in b from packed 16-bit integers in a using saturation.
Subtract packed signed 8-bit integers in b from packed 8-bit integers in a using saturation.
Subtract packed 16-bit unsigned integers in a and b using unsigned saturation.
Subtract packed 8-bit unsigned integers in a and b using unsigned saturation.
Return vector of type __m128d with undefined elements.
Return vector of type __m128i with undefined elements.
Unpack and interleave 16-bit integers from the high half of a and b.
Unpack and interleave 32-bit integers from the high half of a and b.
Unpack and interleave 64-bit integers from the high half of a and b.
Unpack and interleave 8-bit integers from the high half of a and b.
Unpack and interleave double-precision (64-bit) floating-point elements from the high half of a and b.
Unpack and interleave 16-bit integers from the low half of a and b.
Unpack and interleave 32-bit integers from the low half of a and b.
Unpack and interleave 64-bit integers from the low half of a and b.
Unpack and interleave 8-bit integers from the low half of a and b.
Unpack and interleave double-precision (64-bit) floating-point elements from the low half of a and b.
Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise XOR of 128 bits (representing integer data) in a and b.
SSE2 intrinsics. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE2