* initial gemma4 support
* parity with reference implementation
outputs can 100% match transformers with same sdpa flags, checkpoint this and then optimize
* Cleanup, video fixes
* cleanup, enable fused rms norm by default
* update comment
* Cleanup
* Update sd.py
* Various fixes
* Add fp8 scaled embedding support
* small fixes
* Translate think tokens
* Fix image encoder attention mask type
So it works with basic attention
* Handle thinking tokens different only for Gemma4
* Code cleanup
* Update nodes_textgen.py
* Use embed scale class instead of buffer
Slight difference to HF, but technically more accurate and simpler code
* Default to fused rms_norm
* Update gemma4.py
* feat(api-nodes): add Topaz Astra 2 model
Signed-off-by: bigcat88 <bigcat88@icloud.com>
* feat(api-nodes): make Astra 2 the default Topaz upscaler model
Reorder UPSCALER_MODELS_MAP and the upscaler_model dynamic combo so
"Astra 2" appears first, surfacing it as the default selection.
---------
Signed-off-by: bigcat88 <bigcat88@icloud.com>
Co-authored-by: Marwan Mostafa <marawan206@gmail.com>
* mm: Use Aimdo raw allocator for cast buffers
pytorch manages allocation of growing buffers on streams poorly. Pyt
has no windows support for the expandable segments allocator (which is
the right tool for this job), while also segmenting the memory by
stream such that it can be generally re-used. So kick the problem to
aimdo which can just grow a virtual region thats freed per stream.
* plan
* ops: move cpu handler up to the caller
* ops: split up prefetch from weight prep block prefetching API
Split up the casting and weight formating/lora stuff in prep for
arbitrary prefetch support.
* ops: implement block prefetching API
allow a model to construct a prefetch list and operate it for increased
async offload.
* ltxv2: Implement block prefetching
* Implement lora async offload
Implement async offload of loras.
* pinned_memory: remove JIT RAM pressure release
This doesn't work, as freeing intermediates for pins needs to be
higher-priority than freeing pins-for-pins if and when you are going
to do that. So this is too late as pins-for-pins is model load time
and we dont have JIT pins-for-pins.
* cacheing: Add a filter to only free intermediates from inactive wfs
This is to get priorities in amongst pins straight.
* mm: free inactive-ram from RAM cache first
Stuff from inactive workflows should be freed before anything else.
* caching: purge old ModelPatchers first
Dont try and score them, just dump them at the first sign of trouble
if they arent part of the workflow.
On failure (ex: animated webp files) fallback to old pillow code.
This should fix the extra precision in high bit depth images (like 16 bit PNG) being discarded when loaded by Pillow and potentially add support for more image formats.