IMPORTANT NOTE ABOUT MASK LOAD/STORE:
Broadcast 64-bit integer a to all elements of the return value.
Set packed 64-bit integers with the supplied values.
Set packed 64-bit integers with the supplied values in reverse order.
IMPORTANT NOTE ABOUT MASK LOAD/STORE:
Add packed double-precision (64-bit) floating-point elements in a and b.
Add packed single-precision (32-bit) floating-point elements in a and b.
Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b.
Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b.
Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b.
Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b.
Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b.
Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8.
Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8.
Blend packed double-precision (64-bit) floating-point elements from a and b using mask.
Blend packed single-precision (32-bit) floating-point elements from a and b using mask.
Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements. This effectively duplicates the 128-bit vector.
Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements. This effectively duplicates the 128-bit vector.
Broadcast a single-precision (32-bit) floating-point element from memory to all elements.
Cast vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined.
Cast vector of type __m256d to type __m128d; the upper 128 bits of a are lost.
Cast vector of type __m256d to type __m256.
Cast vector of type __m256d to type __m256i.
Cast vector of type __m128 to type __m256; the upper 128 bits of the result are undefined.
Cast vector of type __m256 to type __m128. The upper 128-bit of a are lost.
Cast vector of type __m256 to type __m256d.
Cast vector of type __m256 to type __m256i.
Cast vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined.
Cast vector of type __m256i to type __m256d.
Cast vector of type __m256i to type __m256.
Cast vector of type __m256i to type __m128i. The upper 128-bit of a are lost.
Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements.
Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements.
Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8.
Compare packed double-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8.
Convert packed signed 32-bit integers in a to packed double-precision (64-bit) floating-point elements.
Convert packed signed 32-bit integers in a to packed single-precision (32-bit) floating-point elements.
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers. Follows the current rounding mode.
Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements.
Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, using the current rounding mode.
Convert packed single-precision (32-bit) floating-point elements in a` to packed double-precision (64-bit) floating-point elements.
Return the lower double-precision (64-bit) floating-point element of a.
Return the lower 32-bit integer in a.
Return the lower single-precision (32-bit) floating-point element of a.
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation.
Convert packed single-precision (32-bit) floating-point elements in a.
Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b.
Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b.
Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum using the low 4 bits of imm8.
Extract a 32-bit integer from a, selected with imm8.
Extract a 64-bit integer from a, selected with index.
Extract a 128-bits lane from a, selected with index (0 or 1). Note: _mm256_extractf128_pd!0 is equivalent to _mm256_castpd256_pd128.
Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements.
Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements.
Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b.
Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b.
Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b.
Copy a, and insert the 16-bit integer i into the result at the location specified by index & 15.
Copy a, and insert the 32-bit integer i into the result at the location specified by index & 7.
Copy a, and insert the 64-bit integer i into the result at the location specified by index & 3.
Copy a, and insert the 8-bit integer i into the result at the location specified by index & 31.
Copy a, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b at the location specified by imm8.
Copy a then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b, at the location specified by imm8.
Copy a, then insert 128 bits from b at the location specified by imm8.
Load 256-bits of integer data from unaligned memory into dst. This intrinsic may run better than _mm256_loadu_si256 when the data crosses a cache line boundary.
Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Load 256-bits of integer data from memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value. hiaddr and loaddr do not need to be aligned on any particular boundary.
Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value. hiaddr and loaddr do not need to be aligned on any particular boundary.
Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value. hiaddr and loaddr do not need to be aligned on any particular boundary.
Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory. mem_addr does not need to be aligned on any particular boundary.
Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory. mem_addr does not need to be aligned on any particular boundary.
Load 256-bits of integer data from memory. mem_addr does not need to be aligned on any particular boundary.
Load packed double-precision (64-bit) floating-point elements from memory using mask (elements are zeroed out when the high bit of the corresponding element is not set). See: "Note about mask load/store" to know why you must address valid memory only.
Load packed single-precision (32-bit) floating-point elements from memory using mask (elements are zeroed out when the high bit of the corresponding element is not set). Note: emulating that instruction isn't efficient, since it needs to perform memory access only when needed. See: "Note about mask load/store" to know why you must address valid memory only.
Store packed double-precision (64-bit) floating-point elements from a into memory using mask. See: "Note about mask load/store" to know why you must address valid memory only.
Store packed single-precision (32-bit) floating-point elements from a into memory using mask. See: "Note about mask load/store" to know why you must address valid memory only.
Compare packed double-precision (64-bit) floating-point elements in a and b, and return packed maximum values.
Compare packed single-precision (32-bit) floating-point elements in a and b, and return packed maximum values.
packed minimum values.
Compare packed single-precision (32-bit) floating-point elements in a and b, and return packed maximum values.
Duplicate even-indexed double-precision (64-bit) floating-point elements from a.
Duplicate odd-indexed single-precision (32-bit) floating-point elements from a.
Duplicate even-indexed single-precision (32-bit) floating-point elements from a.
Set each bit of result mask based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.
Set each bit of mask result based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.
Multiply packed double-precision (64-bit) floating-point elements in a and b.
Multiply packed single-precision (32-bit) floating-point elements in a and b.
Compute the bitwise NOT of 256 bits in a. #BONUS
Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b.
Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b.
Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8.
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8. The same shuffle is applied in lower and higher 128-bit lane.
Shuffle double-precision (64-bit) floating-point elements in a using the control in b. Warning: the selector is in bit 1, not bit 0, of each 64-bit element! This is really not intuitive.
Shuffle single-precision (32-bit) floating-point elements in a using the control in b.
Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a. The maximum relative error for this approximation is less than 1.5*2^-12.
Round the packed double-precision (64-bit) floating-point elements in a using the rounding parameter, and store the results as packed double-precision floating-point elements. Rounding is done according to the rounding[3:0] parameter, which can be one of: (_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE
Round the packed single-precision (32-bit) floating-point elements in a using the rounding parameter, and store the results as packed single-precision floating-point elements. Rounding is done according to the rounding[3:0] parameter, which can be one of: (_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE
Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a. The maximum relative error for this approximation is less than 1.5*2^-12.
Broadcast 16-bit integer a to all elements of the return value.
Broadcast 32-bit integer a to all elements.
Broadcast 64-bit integer a to all elements of the return value.
Broadcast 8-bit integer a to all elements of the return value.
Broadcast double-precision (64-bit) floating-point value a to all elements of the return value.
Broadcast single-precision (32-bit) floating-point value a to all elements of the return value.
Set packed 16-bit integers with the supplied values.
Set packed 32-bit integers with the supplied values.
Set packed 64-bit integers with the supplied values.
Set packed 8-bit integers with the supplied values.
Set packed __m256d vector with the supplied values.
Set packed __m256d vector with the supplied values.
Set packed __m256i vector with the supplied values.
Set packed double-precision (64-bit) floating-point elements with the supplied values.
Set packed single-precision (32-bit) floating-point elements with the supplied values.
Set packed 16-bit integers with the supplied values in reverse order.
Set packed 32-bit integers with the supplied values in reverse order.
Set packed 64-bit integers with the supplied values in reverse order.
Set packed 8-bit integers with the supplied values in reverse order.
Set packed __m256 vector with the supplied values.
Set packed __m256d vector with the supplied values.
Set packed __m256i vector with the supplied values.
Set packed double-precision (64-bit) floating-point elements with the supplied values in reverse order.
Set packed single-precision (32-bit) floating-point elements with the supplied values in reverse order.
Return vector of type __m256d with all elements set to zero.
Return vector of type __m256 with all elements set to zero.
Return vector of type __m256i with all elements set to zero.
Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8.
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8.
Compute the square root of packed double-precision (64-bit) floating-point elements in a.
Compute the square root of packed single-precision (32-bit) floating-point elements in a.
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 256-bits (composed of 4 packed single-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. Note: there isn't any particular instruction in AVX to do that. It just defers to SSE2. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a.
Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a.
Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and return 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise return 0.
Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and return 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise return 0.
Compute the bitwise NOT of a and then AND with b, and return 1 if the result is zero, otherwise return 0. In other words, test if all bits masked by b are also 1 in a.
Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, return 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise return 0. In other words, return 1 if a and b don't both have a negative number as the same place.
Compute the bitwise AND of 256 bits (representing double-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, return 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise return 0. In other words, return 1 if a and b don't both have a negative number as the same place.
Compute the bitwise AND of 256 bits (representing integer data) in and return 1 if the result is zero, otherwise return 0. In other words, test if all bits masked by b are 0 in a.
Return vector of type __m256d with undefined elements.
Return vector of type __m256 with undefined elements.
Return vector of type __m256i with undefined elements.
Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b.
Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b.
Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b.
Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b.
Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b.
Cast vector of type __m128d to type __m256d; the upper 128 bits of the result are zeroed.
Cast vector of type __m128 to type __m256; the upper 128 bits of the result are zeroed.
Cast vector of type __m128i to type __m256i; the upper 128 bits of the result are zeroed.
Broadcast a single-precision (32-bit) floating-point element from memory to all elements.
Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8.
Compare packed double-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8.
Compare the lower double-precision (64-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of result, and copy the upper element from a to the upper element of result.
Compare the lower single-precision (32-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of result, and copy the upper 3 packed elements from a to the upper elements of result.
Load packed double-precision (64-bit) floating-point elements from memory using mask (elements are zeroed out when the high bit of the corresponding element is not set). Note: emulating that instruction isn't efficient, since it needs to perform memory access only when needed. See: "Note about mask load/store" to know why you must address valid memory only.
Load packed single-precision (32-bit) floating-point elements from memory using mask (elements are zeroed out when the high bit of the corresponding element is not set). Warning: See "Note about mask load/store" to know why you must address valid memory only.
Store packed double-precision (64-bit) floating-point elements from a into memory using mask. Note: emulating that instruction isn't efficient, since it needs to perform memory access only when needed. See: "Note about mask load/store" to know why you must address valid memory only.
Store packed single-precision (32-bit) floating-point elements from a into memory using mask. Note: emulating that instruction isn't efficient, since it needs to perform memory access only when needed. See: "Note about mask load/store" to know why you must address valid memory only.
Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8.
Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8.
Shuffle double-precision (64-bit) floating-point elements in a using the control in b. Warning: the selector is in bit 1, not bit 0, of each 64-bit element! This is really not intuitive.
Shuffle single-precision (32-bit) floating-point elements in a using the control in b.
Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and return 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise return 0.
Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and return 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise return 0.
Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, return 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise return 0. In other words, return 1 if a and b don't both have a negative number as the same place.
Compute the bitwise AND of 128 bits (representing double-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, return 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise return 0. In other words, return 1 if a and b don't both have a negative number as the same place.
AVX intrinsics. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX