Difference between revisions of "Prismatic VLM REPL"

Latest revision as of 00:51, 21 June 2024

Prismatic VLM is the project upon which OpenVLA is based. The generate.py REPL script is available in the OpenVLA repo as well but is essentially using Prismatic models. Note that the Prismatic models generate natural language whereas OpenVLA models were trained to generate robot actions. (see this https://github.com/openvla/openvla/issues/5).

Of note, the K-Scale OpenVLA adaptation by User:Paweł is at https://github.com/kscalelabs/openvla

Prismatic REPL Script Guide[edit]

Here are some suggestions to run the generate.py REPL Script from the repo (you can find this in the scripts folder).

Prerequisites[edit]

Before running the script, ensure you have the following:

Python 3.8 or higher installed
NVIDIA GPU with CUDA support (optional but recommended for faster processing)
Hugging Face account and token for accessing Meta Lllama

Setting Up the Environment[edit]

In addition to installing requirements-min.txt from the repo, you probably need to install rich, tensorflow_graphics, tensorflow-datasets and dlimp.

Set up Hugging Face token

You need a Hugging Face token to access certain models. Create a .hf_token file thats needed by the script.

Create a file named `.hf_token` in the root directory of your project and add your Hugging Face token to this file:

echo "your_hugging_face_token" > .hf_token

Sample Images for generate.py REPL[edit]

You can get these by capturing frames or screenshotting rollout videos from

 https://openvla.github.io/

Make sure the images have an end effector in them.

Starting REPL mode[edit]

Then, run generate.py. The script starts by initializing the generation playground with the Prismatic model prism-dinosiglip+7b.

The model prism-dinosiglip+7b is downloaded from the Hugging Face Hub.

The model configuration is found and then the model is loaded with the following components:

Vision Backbone: dinosiglip-vit-so-384px

Language Model (LLM) Backbone: llama2-7b-pure (this is also where the hf token comes into play)

Architecture Specifier: no-align+fused-gelu-mlp

Checkpoint Path: The model checkpoint is loaded from a specific path in the cache.

You should see this in your terminal:

After loading the model, the script enters a REPL mode, allowing the user to interact with the model. The REPL mode provides a default generation setup and waits for user inputs.

Basically, the generate.py script runs a REPL that allows users to interactively test generating outputs from the Prismatic model prism-dinosiglip+7b. Upon running the script, users can enter commands in the REPL prompt:

type (i) to load a new local image by specifying its path, (p) to update the prompt template for generating outputs, (q) to quit the REPL, or directly input a prompt to generate a response based on the loaded image and the specified prompt.

@@ Line 1: / Line 1: @@
-The K-Scale OpenVLA adaptation by [[User:Paweł]]  is at https://github.com/kscalelabs/openvla
+[https://github.com/TRI-ML/prismatic-vlms Prismatic VLM] is the project upon which OpenVLA is based. The generate.py REPL script is available in the OpenVLA repo as well but is essentially using Prismatic models.  Note that the Prismatic models generate natural language whereas OpenVLA models were trained to generate robot actions. (see this https://github.com/openvla/openvla/issues/5).
-== REPL Script Guide ==
+Of note, the K-Scale OpenVLA adaptation by [[User:Paweł]]  is at https://github.com/kscalelabs/openvla
-Here are some suggestions to run the generate.py REPL Script from the repo if you would like to get started with OpenVLA.
+== Prismatic REPL Script Guide ==
+Here are some suggestions to run the generate.py REPL Script from the repo (you can find this in the '''scripts''' folder).
 == Prerequisites ==
@@ Line 33: / Line 35: @@
 Make sure the images have an end effector in them.
+[[File:Coke can2.png|400px|Can pickup task]]
+== Starting REPL mode ==
+Then, run generate.py. The script starts by initializing the generation playground with the Prismatic model prism-dinosiglip+7b.
+The model prism-dinosiglip+7b is downloaded from the Hugging Face Hub.
+The model configuration is found and then the model is loaded with the following components:
+Vision Backbone: dinosiglip-vit-so-384px
+Language Model (LLM) Backbone: llama2-7b-pure (this is also where the hf token comes into play)
+Architecture Specifier: no-align+fused-gelu-mlp
+Checkpoint Path: The model checkpoint is loaded from a specific path in the cache.
+You should see this in your terminal:
+[[File:Openvla1.png|800px|prismatic models]]
+''After loading the model, the script enters a REPL mode, allowing the user to interact with the model. The REPL mode provides a default generation setup and waits for user inputs.''
+Basically, the generate.py script runs a REPL that allows users to interactively test generating outputs from the Prismatic model prism-dinosiglip+7b. Upon running the script, users can enter commands in the REPL prompt:
-''work in progress,need to add screenshots and next steps''
+type (i) to load a new local image by specifying its path,
+(p) to update the prompt template for generating outputs,
+(q) to quit the REPL, or directly input a prompt to generate a response based on the loaded image and the specified prompt.
+[[File:Prismatic chat1.png|800px|prismatic chat]]

Difference between revisions of "Prismatic VLM REPL"

Latest revision as of 00:51, 21 June 2024

Contents

Prismatic REPL Script Guide[edit]

Prerequisites[edit]

Setting Up the Environment[edit]

Sample Images for generate.py REPL[edit]

Starting REPL mode[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools