Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Ming SDK, a powerful Python API that streamlines multimodal interactions with the Ming-Flash-2.0 model. It unifies capabilities across text, speech, and image generation and understanding, offering both synchronous and asynchronous operations with streaming support. The changes also incorporate significant performance enhancements for image generation through optimized transformer architectures and a new HTTP service for scalable image processing, alongside comprehensive documentation and monitoring tools. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the Ming-SDK, a comprehensive toolkit for interacting with the Ming-Flash-2.0 model, supporting multimodal functionalities like text, speech, and image generation. The changes include extensive documentation in the README, the full SDK source code under the ming_sdk/ directory, and modifications to existing model and utility files to support new features like Taylor cache acceleration. My review has identified a critical security vulnerability in the new image generation server due to the use of pickle for deserialization. Additionally, I've found several medium to high severity issues, including bugs in documentation examples, potential regressions in utility functions, and maintainability concerns like hardcoded values and commented-out code. Addressing these points will significantly improve the quality, security, and usability of the new SDK.
| # with open(filename, "a", encoding="utf-8") as f: | ||
| # f.write(f"{gen_param_b64}\n") | ||
|
|
||
| gen_param = base64_to_dict(gen_param_b64) |
There was a problem hiding this comment.
Using pickle.loads to deserialize data received from a network request is a critical security vulnerability that can lead to Remote Code Execution (RCE). An attacker could craft a malicious payload that executes arbitrary code on the server when unpickled. Please use a safe serialization format like JSON for data exchange over the network.
| max_frames = max( | ||
| 1, | ||
| floor_by_factor( | ||
| ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR | ||
| ), | ||
| ) | ||
| if "nframes" in ele: | ||
| nframes = min(total_frames, round_by_factor(ele["nframes"], FRAME_FACTOR), max_frames) | ||
| else: | ||
| fps = ele.get("max_video_fps", FPS) | ||
| nframes = total_frames / video_fps * fps | ||
| nframes = max(1, total_frames / video_fps * fps) | ||
| if nframes > total_frames: | ||
| logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]") | ||
| nframes = min(min(max(nframes, min_frames), max_frames), total_frames) | ||
| nframes = floor_by_factor(nframes, FRAME_FACTOR) | ||
| if not (FRAME_FACTOR <= nframes <= total_frames): | ||
| raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.") | ||
| return nframes | ||
| nframes = min(min(nframes, max_frames), total_frames) | ||
| return int(nframes) |
There was a problem hiding this comment.
The refactoring of v1_smart_nframes has removed the validation that ensured the number of frames (nframes) is at least FRAME_FACTOR (which is 2). The new logic only guarantees nframes >= 1. This could introduce a regression if downstream code expects at least 2 frames. Additionally, the ValueError that was previously raised for out-of-bounds nframes has been removed, which might hide potential configuration issues from the user.
| if ref_hidden_states_input is not None: | ||
| ref_hidden_states_input = ref_hidden_states_input.to(latent_model_input.dtype) | ||
| else: | ||
| latent_model_input = latents.to(self.transformer.dtype) | ||
| prompt_embeds_model_input = prompt_embeds | ||
| timestep_model_input = timestep | ||
| ref_hidden_states_input = ref_hidden_states*1.0 if ref_hidden_states is not None else None | ||
| if ref_hidden_states_input is not None: | ||
| ref_hidden_states_input = ref_hidden_states_input.to(latent_model_input.dtype) |
There was a problem hiding this comment.
The ref_hidden_states_input tensor is cast to latent_model_input.dtype in two separate places. To avoid code duplication and improve readability, you could perform this type casting once after the if/else block where ref_hidden_states_input is defined.
else:
latent_model_input = latents.to(self.transformer.dtype)
prompt_embeds_model_input = prompt_embeds
timestep_model_input = timestep
ref_hidden_states_input = ref_hidden_states*1.0 if ref_hidden_states is not None else None
if ref_hidden_states_input is not None:
ref_hidden_states_input = ref_hidden_states_input.to(latent_model_input.dtype)| # Streaming speech QA with interruption | ||
| all_wavs = [] | ||
| all_text = "" | ||
| request_id = "" | ||
| output_audio_path = "test_stream.wav" | ||
| for data_type, data_content in ming.generate_stream( | ||
| text="介绍一下杭州", output_type="speech", max_new_tokens=128 | ||
| ): | ||
| if data_type == "text_data": | ||
| text, usage = data_content | ||
| elif data_type == "text_audio_data": | ||
| tts_speech, text, meta_info, session_id, usage = data_content | ||
| all_text += text | ||
| all_wavs.append(tts_speech) | ||
| if len(all_text) > 20: | ||
| ming.generate_interrupt(request_id) | ||
| waveform = torch.cat(all_wavs, dim=-1) | ||
| sr = 44100 | ||
| torchaudio.save(output_audio_path, waveform, sr) | ||
| print(f"request_id:{request_id},audio:{output_audio_path},text={all_text}") | ||
| assert os.path.exists(output_audio_path) | ||
|
|
||
| ``` |
There was a problem hiding this comment.
The example for streaming speech QA with interruption has a bug. The request_id variable is initialized as an empty string and is never updated within the loop. However, it's passed to ming.generate_interrupt(request_id). This will not work as intended.
To fix this, you should pass a unique msg_request_id to the generate_stream call and use that same ID for the interruption, as shown in the 'Request Interruption' example later in the README. The session_id returned by the stream should also be used correctly if it's intended for this purpose.
| self.saved_features = [] | ||
|
|
||
| #[0, 1, 2, 3, 4, 5, 7, 9, 11, 14, 17, 20, 22, 24, 26, 27, 28, 29]: | ||
| if step in [0, 1, 2, 3, 4, 5, 7, 9, 12, 15, 18, 21, 23, 25, 26, 27, 28, 29]: |
There was a problem hiding this comment.
| except Exception as e: | ||
| logger.error(f"Error in text producer: {e}") | ||
| talker_input_queue.put(None, None) |
There was a problem hiding this comment.
Catching a generic Exception is too broad and can mask underlying issues, making debugging more difficult. It's better to catch more specific exceptions that you expect might occur in the producer thread. If a generic catch is necessary, consider logging the full traceback and re-raising the exception to avoid silently swallowing errors.
| def logging_tasks(self, request_id, future): | ||
| time.sleep(0.01) | ||
| running_keys = [] | ||
| pendding_keys = [] |
| from setuptools import setup, find_packages | ||
|
|
||
| __version__ = "1.0.0" # | ||
| requirement = open("ming_sdk/requirements.txt").readlines() |
| package_data={"ming_sdk": fetch_installed_data("")}, | ||
| description="Ming Multimodal sdk", | ||
| keywords="Ming Multimodal sdk", | ||
| packages=["ming_sdk"], |
There was a problem hiding this comment.
The packages argument is set to ["ming_sdk"], which is correct. However, using find_packages() from setuptools is a more robust and conventional way to automatically discover all packages within the project. This avoids manual updates if you add sub-packages in the future.
| packages=["ming_sdk"], | |
| packages=find_packages(), |
| return_dict=return_dict, | ||
| return_assistant_tokens_mask=return_assistant_tokens_mask, | ||
| tokenizer_kwargs=tokenizer_kwargs, | ||
| **kwargs, |
There was a problem hiding this comment.
While adding **kwargs provides flexibility, it can also hide which specific parameters are being used by the underlying apply_chat_template call. It would be more explicit and maintainable to list the expected keyword arguments (like enable_thinking from the new template) in the function signature with default values.
Add Ming-SDK, a tool that facilitates multimodal interaction with the Ming-Flash-2.0 model. For detailed usage instructions of Ming-SDK, please refer to the updated README.