Given the fast pace of change in the AI and generative AI market, it can be difficult to stay up to date with the latest news, technology, and announcements. With that in mind, I'm starting a new blog series focusing on what I think are the most important industry developments and, of course, my views on why they matter.
Enhanced code generation with CodeLlama 70B release from Meta
I recently spoke about the role of generative AI in application development as part of TechTarget and BrightTALK's Generative AI Summit. According to a study by TechTarget's Enterprise Strategy Group (ESG), one in five organizations surveyed expects companies to make significant investments in generative AI capabilities for application and software development in 2024. I understand that you are thinking that. These organizations will aim to accelerate software delivery and improve developer efficiency across multiple uses. Cases that lead to faster and better code writing and documentation.
Reflecting this trend, Meta recently released CodeLlama 70B, an open source model built for code generation. This is a free download. It was trained with 1 trillion code tokens and the context window was significantly increased to 100,000 tokens.
This model can process and generate longer, more complex code in Python, C++, Java, PHP, and other popular languages based on natural language prompts or existing code snippets. What's more, Meta claims it can do it faster and more accurately than ever before. However, while Llama's performance has improved significantly, Meta is still chasing OpenAI's GPT-4.
Release of Hermes training data from Nous Research highlights the continued importance of transparency
Another area that I find fascinating is the scrutiny of training data for some of the largest models in use today. This is where the increasingly important topic of indemnity comes into play. We'll talk more about this in a follow-up blog.
Following recent headlines regarding the New York Times lawsuit against OpenAI and Microsoft for copyright infringement, it will now be important for large-scale language model (LLM) authors and end users alike to understand indemnification provisions. Masu. And it's not just about the output and response, but also about the data used to train the LLM.
A newsworthy moment occurred last week when applied research group Nous Research released its entire data set, containing more than 1 million data points, used to train the OpenHermes 2.5 and Nous Hermes 2 models. By attributing virtually all of the data in our training set to somewhere within the open source ecosystem, we are setting a new standard for openness and transparency.
OpenAI faces further legal pressure from EU
Over the past year, different countries have restricted the use of ChatGPT for different reasons, with Italy in particular imposing and then lifting a ban. On January 29, Italy's data protection authority gave OpenAI 30 days to respond to complaints that ChatGPT violates the EU's General Data Protection Regulation.
The issue appears to center around OpenAI's processing of personal data to train AI models. If violations are found, fines could reach up to €20 million, or up to 4% of the company's annual global revenue, but the bigger concern for OpenAI is data collection and processing practices related to EU member states. This means that changes may be required. .
ESG research found that 95% of organizations have some type of active compliance guidelines related to data used in AI projects. But this level of backlash is almost expected when companies hide information about the exact data used to train proprietary and commercial models.
With further regulation expected across the global AI market, organizations need to move quickly to not only prevent legal issues, but also ensure the accuracy and unbiasedness of data used for AI training and insights. We need to be ready to act. Ultimately, this will be critical to whether companies can demonstrate that they are developing AI in a responsible and trustworthy manner.
Huge investments in AI unicorns generated by big tech companies trigger FTC investigation
With ESG research highlighting the skills gap as a key challenge for organizations considering generative AI adoption, it's no wonder organizations are turning to the leaders at the forefront of innovation. Not. Several organizations stand out as such leaders, including major cloud providers such as Microsoft, Google, and Amazon, and AI unicorns such as OpenAI and Anthropic.
But all five companies have recent Federal Trade Commission (FTC) partnerships, including Microsoft's $10 billion investment in OpenAI, as well as Amazon's $4 billion investment and Google's $300 million investment in Anthropic. is currently being scrutinized as part of an investigation into the investment. And these are just examples of the large-scale partnerships associated with his latest FTC investigation. There are many other examples of strategic partnerships between large technology companies and generative AI companies.
In my view, the core of this FTC investigation is the nature of these investment agreements and their impact on market share and competition. In my view, recent breakthroughs in generative AI would not have been possible without these partnerships and investments, which have significantly accelerated innovation, even outside of the technology companies I specifically mentioned.
Voltron Data acquires real-time AI company Claypot AI
Voltron Data's recent acquisition, Claypot AI, works with both batch and streaming data, clearly aligning with Voltron's goals. Voltron recently announced Theseus, a composable and fast data processing engine built on GPUs, but until now the company has primarily focused on local and batch data.
The acquisition of Claypot will enable Voltron to offer streaming data analytics using the same open standards the company was founded on, including the integration of open source technologies such as Apache Arrow, Apache Parquet, and Ibis. Importantly, these two teams have been collaborating on building streaming data backends for some time. For customers, this development promises access to real-time AI capabilities in addition to several other core AI lifecycle technologies, including feature engineering enablement and MLOps.
Mike Leone is a Principal Analyst in TechTarget's Enterprise Strategy Group, covering data, analytics, and AI.
Enterprise Strategy Group is a division of TechTarget. The company's analysts have business relationships with technology vendors.