Difference between revisions of "Robotic Control via Embodied Chain-of-Thought Reasoning"

From Humanoid Robots Wiki
Jump to: navigation, search
(Created page with "= Robotic Control via Embodied Chain-of-Thought Reasoning = The codebase is built on top of OpenVLA. We refer to it for the detailed documentation of the code and dependencie...")
 
 
Line 1: Line 1:
 
= Robotic Control via Embodied Chain-of-Thought Reasoning =
 
= Robotic Control via Embodied Chain-of-Thought Reasoning =
  
The codebase is built on top of OpenVLA. We refer to it for the detailed documentation of the code and dependencies.
+
Embodied Chain-of-Thought Reasoning (ECoT) is a novel approach for training robotic policies. This approach trains a vision-language-action model to generate reasoning steps in response to instructions and images before choosing a robot action, enabling better performance, interpretability, and generalization.
 +
 
 +
The codebase is built on top of OpenVLA. Refer to it for detailed documentation of the code and dependencies.
  
 
== Quickstart ==
 
== Quickstart ==
We provide a Colab notebook containing code for loading up our ECoT policy and using it to generate reasoning and actions in response to an observation. Loading the model for inference is easy:
+
A Colab notebook is provided containing code for loading the ECoT policy and using it to generate reasoning and actions in response to an observation. Loading the model for inference is easy:
 
<code>
 
<code>
 
from transformers import AutoModelForVision2Seq, AutoProcessor
 
from transformers import AutoModelForVision2Seq, AutoProcessor
Line 45: Line 47:
  
 
== Pretrained models ==
 
== Pretrained models ==
We release two ECoT models trained as part of our work, and the dataset of reasonings, available on our HuggingFace page:
+
Two ECoT models and a dataset of reasonings are available on the HuggingFace page:
* '''ecot-openvla-7b-bridge''': The main model that we used for most of our experiments. It was trained on the Bridge dataset annotated with the reasoning for 80k steps.
+
* '''ecot-openvla-7b-bridge''': The main model used for most of the experiments. It was trained on the Bridge dataset annotated with the reasoning for 80k steps.
* '''ecot-openvla-7b-oxe''': A policy that was initially trained on the Open-X-Embodiment dataset actions, fine-tuned on the mixture of OXE action-only data and our reasonings for Bridge for another 20k steps.
+
* '''ecot-openvla-7b-oxe''': A policy initially trained on the Open-X-Embodiment dataset actions, fine-tuned on the mixture of OXE action-only data and reasonings for Bridge for another 20k steps.
 
* '''embodied_features_bridge''': A dataset of the embodied features and reasonings collected for Bridge demonstrations.
 
* '''embodied_features_bridge''': A dataset of the embodied features and reasonings collected for Bridge demonstrations.
  
 
=== Explicit Notes on Model Licensing & Commercial Use ===
 
=== Explicit Notes on Model Licensing & Commercial Use ===
While all code in this repository is released under an MIT License, our pretrained models may inherit restrictions from the underlying base models we use. Specifically, both the above models are derived from Llama-2, and as such are subject to the Llama Community License.
+
While all code in this repository is released under an MIT License, the pretrained models may inherit restrictions from the underlying base models. Specifically, both models are derived from Llama-2, and are subject to the Llama Community License.
  
 
== Installation ==
 
== Installation ==
See the original OpenVLA repository for detailed installation instructions.
+
Refer to the original OpenVLA repository for detailed installation instructions.
  
 
== Repository Structure ==
 
== Repository Structure ==
Line 61: Line 63:
 
* '''experiments''': Code for evaluating the policies on a WidowX robot.
 
* '''experiments''': Code for evaluating the policies on a WidowX robot.
 
* '''vla-scripts/''': Core scripts for training, fine-tuning, and deploying VLAs.
 
* '''vla-scripts/''': Core scripts for training, fine-tuning, and deploying VLAs.
* '''LICENSE''': All code is made available under the MIT License; happy hacking!
+
* '''LICENSE''': All code is made available under the MIT License.
 
* '''Makefile''': Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
 
* '''Makefile''': Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
 
* '''pyproject.toml''': Full project configuration details (including dependencies), as well as tool configurations.
 
* '''pyproject.toml''': Full project configuration details (including dependencies), as well as tool configurations.
Line 67: Line 69:
  
 
== Citation ==
 
== Citation ==
If you find our code or models useful in your work, please cite our paper:
+
If the code or models are useful in your work, please cite the paper:
 
<code>
 
<code>
 
@article{Zawalski24-ecot,
 
@article{Zawalski24-ecot,

Latest revision as of 20:05, 18 July 2024

Robotic Control via Embodied Chain-of-Thought Reasoning[edit]

Embodied Chain-of-Thought Reasoning (ECoT) is a novel approach for training robotic policies. This approach trains a vision-language-action model to generate reasoning steps in response to instructions and images before choosing a robot action, enabling better performance, interpretability, and generalization.

The codebase is built on top of OpenVLA. Refer to it for detailed documentation of the code and dependencies.

Quickstart[edit]

A Colab notebook is provided containing code for loading the ECoT policy and using it to generate reasoning and actions in response to an observation. Loading the model for inference is easy: from transformers import AutoModelForVision2Seq, AutoProcessor

device = "cuda" path_to_hf = "Embodied-CoT/ecot-openvla-7b-bridge" processor = AutoProcessor.from_pretrained(path_to_hf, trust_remote_code=True) vla = AutoModelForVision2Seq.from_pretrained(path_to_hf, torch_dtype=torch.bfloat16).to(device)

observation = <ROBOT IMAGE OBSERVATION HERE> instruction = <YOUR INSTRUCTION HERE> prompt = "A chat between a curious user and an artificial intelligence assistant. " + \

   "The assistant gives helpful, detailed, and polite answers to the user's questions. " + \
   f"USER: What action should the robot take to {instruction.lower()}? ASSISTANT: TASK:"

inputs = processor(prompt, image).to(device, dtype=torch.bfloat16) action, generated_ids = vla.predict_action(**inputs, unnorm_key="bridge_orig", max_new_tokens=1024) generated_text = processor.batch_decode(generated_ids)[0] The standard model in torch.bfloat16 requires 16 GB of GPU memory, but using bitsandbytes and 4-bit quantization lowers memory usage to around 5 GB. See the Colab for more details.

Training and Evaluation[edit]

To train the models from scratch, use the following command: torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/train.py \

 --vla.type "prism-dinosiglip-224px+mx-bridge"  \
 --data_root_dir <path to training data root>  \
 --run_root_dir <path to checkpoint saving directory>  \
 --wandb_project <wandb project name>  \
 --wandb_entity <wandb user name>

To evaluate the model on the WidowX robot: python3 experiments/bridge/eval_model_in_bridge_env.py

 --model.type prism-dinosiglip-224px+7b
 --pretrained_checkpoint <path to checkpoint>
 --host_ip <robot interface IP>
 --port <robot interface port>

Pretrained models[edit]

Two ECoT models and a dataset of reasonings are available on the HuggingFace page:

  • ecot-openvla-7b-bridge: The main model used for most of the experiments. It was trained on the Bridge dataset annotated with the reasoning for 80k steps.
  • ecot-openvla-7b-oxe: A policy initially trained on the Open-X-Embodiment dataset actions, fine-tuned on the mixture of OXE action-only data and reasonings for Bridge for another 20k steps.
  • embodied_features_bridge: A dataset of the embodied features and reasonings collected for Bridge demonstrations.

Explicit Notes on Model Licensing & Commercial Use[edit]

While all code in this repository is released under an MIT License, the pretrained models may inherit restrictions from the underlying base models. Specifically, both models are derived from Llama-2, and are subject to the Llama Community License.

Installation[edit]

Refer to the original OpenVLA repository for detailed installation instructions.

Repository Structure[edit]

High-level overview of repository/project file-tree:

  • prismatic: Package source; provides core utilities for model loading, training, data preprocessing, etc.
  • experiments: Code for evaluating the policies on a WidowX robot.
  • vla-scripts/: Core scripts for training, fine-tuning, and deploying VLAs.
  • LICENSE: All code is made available under the MIT License.
  • Makefile: Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
  • pyproject.toml: Full project configuration details (including dependencies), as well as tool configurations.
  • README.md: You are here!

Citation[edit]

If the code or models are useful in your work, please cite the paper: @article{Zawalski24-ecot,

   title={Robotic Control via Embodied Chain-of-Thought Reasoning},
   author={Michał Zawalski and William Chen and Karl Pertsch and Oier Mees and Chelsea Finn and Sergey Levine},
   journal={arXiv preprint arXiv:2407.08693},
   year={2024}

}