Open-source News

DDR5 Memory Channel Scaling Performance With AMD EPYC 9004 Series

Phoronix - Fri, 01/06/2023 - 20:30
In addition to the big performance uplift from AVX-512, up to 96 cores per socket, and other Zen 4 architectural improvements, also empowering the EPYC 9004 "Genoa" processors is the support for up to 12 channels of DDR5-4800 memory. In this article is a wide assortment of benchmarks looking at the AMD EPYC 9654 performance across varying numbers of populated DDR5 memory channels.

An MGLRU Performance Regression Fix Is On The Way Plus Another Optimization

Phoronix - Fri, 01/06/2023 - 18:47
While MGLRU is a nice performance win for the Linux kernel now available when enabling it for v6.1+ kernel builds, during my testing I did encounter a regression around the SVT-AV1 video encode performance at least and a fix is working its way toward mainline...

X.Org Server No Longer Allowing Byte-Swapped Clients By Default

Phoronix - Fri, 01/06/2023 - 18:34
Following the recent discussions around Fedora planning to disable byte swapped clients support for the X.Org Server in order to close another "large attack surface" with the aging X11 server codebase, the upstream X.Org Server has now dropped this support by default...

The Linux OS Originally Known As Lindows Is Out With Linspire 12 Alpha

Phoronix - Fri, 01/06/2023 - 18:21
Linspire as the Linux distribution whose roots go back two decades ago to when it originally started out as "Lindows" is preparing a new major release. The current incarnation of Linspire though started five years ago when PC/OpenSystems acquired the Linspire and Freespire rights from Xandros. Linspire claims to be "the easiest desktop Linux" and are looking to improve things further with their v12 release...

Mold 1.9 Released With Support For More CPU Architectures

Phoronix - Fri, 01/06/2023 - 18:05
Mold as the high performance linker alternative to GNU Gold and LLVM LLD is out with another feature release...

Use time-series data to power your edge projects with open source tools

opensource.com - Fri, 01/06/2023 - 16:00
Use time-series data to power your edge projects with open source tools zoe_steinkamp Fri, 01/06/2023 - 03:00

Gathering data as it changes over the passage of time is known as time-series data. Today, it has become a part of every industry and ecosystem. It is a large part of the growing IoT sector and will become a larger part of everyday people's lives. But time-series data and its requirements are hard to work with. This is because there are no tools that are purpose-built to work with time-series data. In this article, I go into detail about those problems and how InfluxData has been working to solve them for the past 10 years.

InfluxData

InfluxData is an open source time-series database platform. You may know about the company through InfluxDB, but you may not have known that it specialized in time-series databases. This is significant, because when managing time-series data, you deal with two issues — storage lifecycle and queries.

When it comes to storage lifecycle, it's common for developers to initially collect and analyze highly detailed data. But developers want to store smaller, downsampled datasets that describe trends without taking up as much storage space.

When querying a database, you don't want to query your data based on IDs. You want to query based on time ranges. One of the most common things to do with time-series data is to summarize it over a large period of time. This kind of query is slow when storing data in a typical relational database that uses rows and columns to describe the relationships of different data points. A database designed to process time-series data can handle queries exponentially faster. InfluxDB has its own built-in querying language: Flux. This is specifically built to query on time-series data sets.

Image by:

(Zoe Steinkamp, CC BY-SA 4.0)

Data acquisition

Data acquisition and data manipulation come out of the box with some awesome tools. InfluxData has over 12 client libraries that allow you to write and query data in the coding language of your choice. This is a great tool for custom use cases. The open source ingest agent, Telegraf, includes over 300 input and output plugins. If you're a developer, you can contribute your own plugin, as well.

InfluxDB can also accept a CSV upload for small historical data sets, as well as batch imports for large data sets.

import math
bicycles3 = from(bucket: "smartcity")
    |> range(start:2021-03-01T00:00:00z, stop: 2021-04-01T00:00:00z)
    |> filter(fn: (r) => r._measurement == "city_IoT")
    |> filter(fn: (r) => r._field == "counter")
    |> filter(fn: (r) => r.source == "bicycle")
    |> filter(fn: (r) => r.neighborhood_id == "3")
    |> aggregateWindow(every: 1h, fn: mean, createEmpty:false)

bicycles4 = from(bucket: "smartcity")
    |> range(start:2021-03-01T00:00:00z, stop: 2021-04-01T00:00:00z)
    |> filter(fn: (r) => r._measurement == "city_IoT")
    |> filter(fn: (r) => r._field == "counter")
    |> filter(fn: (r) => r.source == "bicycle")
    |> filter(fn: (r) => r.neighborhood_id == "4")
    |> aggregateWindow(every: 1h, fn: mean, createEmpty:false)

join(tables: {neighborhood_3: bicycles3, neighborhood_4: bicycles4}, on ["_time"], method: "inner")
    |> keep(columns: ["_time", "_value_neighborhood_3","_value_neighborhood_4"])
    |> map(fn: (r) => ({
        r with
        difference_value : math.abs(x: (r._value_neighborhood_3 - r._value_neighborhood_4))
    }))Flux

Flux is our internal querying language built from the ground up to handle time-series data. It's also the underlying powerhouse for a few of our tools, including tasks, alerts, and notifications. To dissect the flux query from above, you need to define a few things. For starters, a "bucket" is what we call a database. You configure your buckets and then add your data stream into them. The query calls the smartcity bucket, with the range of a specific day (a 24-hour period to be exact.) You can get all the data from the bucket, but most users include a data range. That's the most basic flux query you can do.

Next, I add filters, which filter the data down to something more exact and manageable. For example, I filter for the count of bicycles in the neighborhood assigned to the id of 3. From there, I use aggregateWindow to get the mean for every hour. That means I expect to receive a table with 24 columns, one for every hour in the range. I do this exact same query for neighborhood 4 as well. Finally, I join the two tables and get the differences between bike usage in these two neighborhoods.

This is great if you want to know what hours are high-traffic hours. Obviously, this is just a small example of the power of flux queries. But it gives a great example of some of the tools flux comes with. I also have a large amount of data analysis and statistics functions. But for that, I suggest checking out the Flux documentation.

import "influxdata/influxdb/tasks"

option task = {name: PB_downsample, every: 1h, offset: 10s}
from(bucket: "plantbuddy")
    |>range(start: tasks.lastSuccess(orTime: -task.every))
    |>filter(fn: (r) => r["_measurement"] == "sensor_data")
    |>aggregateWindow(every: 10m, fn:last, createEmpty:false)
    |>yield(name: "last")
    |>to(bucket: "downsampled")Tasks

An InfluxDB task is a scheduled Flux script that takes a stream of input data and modifies or analyzes it in some way. It then stores the modified data in a new bucket or performs other actions. Storing a smaller data set into a new bucket is called "downsampling," and it's a core feature of the database, and a core part of the time-series data lifecycle.

You can see in the current task example that I've downsampled the data. I'm getting the last value for every 10-minute increment and storing that value in the downsampled bucket. The original data set might have had thousands of data points in those 10 minutes, but now the downsampled bucket only has 60 new values. One thing to note is that I'm also using the last success function in range. This tells InfluxDB to run this task from the last time it ran successfully, just in case it has failed for the past 2 hours, in which case it can go back three hours in time to the last successful run. This is great for built-in error handling.

Image by:

(Zoe Steinkamp, CC BY-SA 4.0)

 

Checks and alerts

InfluxDB includes an alerting or checks and notification system. This system is very straightforward. You start with a check that looks at the data periodically for anomalies that you've defined. Normally, this is defined with thresholds. For example, any temperature value under 32° F gets assigned a value of WARN, and anything above 32° F gets assigned a value of OK, and anything below 0° F gets a value of CRITICAL. From there, your check can run as often as you deem necessary. There is a recorded history of your checks and the current status of each. You are not required to set up a notification when it's not needed. You can just reference your alert history as needed.

Many people choose to set up their notifications. For that, you need to define a notification endpoint. For example, a chat application could make an HTTP call to receive your notifications. Then you define when you would like to receive notifications, for example you can have checks run every hour. You can run notifications every 24 hours. You can have your notification respond to a change in the value, for example WARN to CRITICAL, or when a value is CRITICAL, regardless of it changing from OK to WARN. This is a highly customizable system. The Flux code that's created from this system can also be edited.

Image by:

(Zoe Steinkamp, CC BY-SA 4.0)

More on edge computing Understanding edge computing Why Linux is critical to edge computing eBook: Running Kubernetes on your Raspberry Pi Download now: The automated enterprise eBook eBook: A practical guide to home automation using open source tools eBook: 7 examples of automation on the edge What is edge machine learning? The latest on edge Edge

To wrap up, I'd like to bring all the core features together, including a very special new feature that's recently been released. Edge to cloud is a very powerful tool that allows you to run the open source InfluxDB and locally store your data in case of connectivity issues. When connectivity is repaired, it streams the data to the InfluxData cloud platform.

This is significant for edge devices and important data where any loss of data is detrimental. You define that you want a bucket to be replicated to the cloud, and then that bucket has a disk-backed queue to store the data locally. Then you define what your cloud bucket should replicate into. The data is stored locally until connected to the cloud.

InfluxDB and the IoT Edge

Suppose you have a project where you want to monitor the health of household plants using IoT sensors attached to the plant. The project is set up using your laptop as the edge device. When your laptop is closed or otherwise off, it stores the data locally, and then streams it to my cloud bucket when reconnected.

Image by:

(Zoe Steinkamp, CC BY-SA 4.0)

One thing to notice is that this downsamples data on the local device before storing it in the replication bucket. Your plant's sensors provide a data point for every second. But it condenses the data to be an average of one minute so you have less data to store. In the cloud account, you might add some alerts and notifications that let you know when the plant's moisture is below a certain level and needs to be watered. There could also be visuals you could use on a website to tell users about their plants' health.

Databases are the backbone of many applications. Working with time-stamped data in a time series database platform like InfluxDB saves developers time, and gives them access to a wide range of tools and services. The maintainers of InfluxDB love seeing what people are building within our open source community, so connect with us and share your projects and code with others!

InfluxData is an open source time-series database platform. Here's how it can be used for edge applications.

Image by:

Opensource.com

Edge computing Internet of Things (IoT) Data Science What to read next This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License. Register or Login to post a comment.

An introduction to DocArray, an open source AI library

opensource.com - Fri, 01/06/2023 - 16:00
An introduction to DocArray, an open source AI library Jina_AI Fri, 01/06/2023 - 03:00

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, and so on. It allows deep-learning engineers to efficiently process, embed, search, store, recommend, and transfer multi-modal data with a Pythonic API. Starting in November of 2022, DocArray is open source and hosted by the Linux Foundation AI & Data initiative so that there’s a neutral home for building and supporting an open AI and data community. This is the start of a new day for DocArray.

In the ten months since DocArray’s first release, its developers at Jina AI have seen more and more adoption and contributions from the open source community. Today, DocArray powers hundreds of multimodal AI applications.

Hosting an open source project with the Linux Foundation

Hosting a project with the Linux Foundation follows open governance, meaning there’s no one company or individual in control of a project. When maintainers of an open source project decide to host it at the Linux Foundation, they specifically transfer the project’s trademark ownership to the Linux Foundation.

In this article, I’ll review the history and future of DocArray. In particular, I’ll demonstrate some cool features that are already in development.

A brief history of DocArray

Jina AI introduced the concept of "DocArray" in Jina 0.8 in late 2020. It was the jina.types module, intended to complete neural search design patterns by clarifying low-level data representation in Jina. Rather than working with Protobuf directly, the new Document class offered a simpler and safer high-level API to represent multimodal data.

Image by:

(Jina AI, CC BY-SA 4.0)

Over time, we extended jina.types and moved beyond a simple Pythonic interface of Protobuf. We added DocumentArray to ease batch operations on multiple DocumentArrays. Then we brought in IO and pre-processing functions for different data modalities, like text, image, video, audio, and 3D meshes. The Executor class started to use DocumentArray for input and output. In Jina 2.0 (released in mid-2021) the design became stronger still. Document, Executor, and Flow became Jina's three fundamental concepts:

• Document is the data IO in Jina
• Executor defines the logic of processing Documents
• Flow ties Executors together to accomplish a task.

The community loved the new design, as it greatly improved the developer experience by hiding unnecessary complexity. This lets developers focus on the things that really matter.

Image by:

(Jina AI, CC BY-SA 4.0)

More on AI What is AI/ML? Cheat sheet: AI glossary eBook: A practical guide to home automation using open source tools Why deploy AI/ML workloads on OpenShift Develop, train, and test your ML workloads with Red Hat OpenShift Data Science Open Source Stories: Road to AI What is edge machine learning? Latest AI and machine learning artilces

As jina.types grew, it became conceptually independent from Jina. While jina.types was more about building locally, the rest of Jina focused on service-ization. Trying to achieve two very different targets in one codebase created maintenance hurdles. On the one hand, jina.types had to evolve fast and keep adding features to meet the needs of the rapidly evolving AI community. On the other hand, Jina itself had to remain stable and robust as it served as infrastructure. The result? A slowdown in development.

We tackled this by decoupling jina.types from Jina in late 2021. This refactoring served as the foundation of the later DocArray. It was then that DocArray's mission crystallized for the team: to provide a data structure for AI engineers to easily represent, store, transmit, and embed multimodal data. DocArray focuses on local developer experience, optimized for fast prototyping. Jina scales things up and uplifts prototypes into services in production. With that in mind, Jina AI released DocArray 0.1 in parallel with Jina 3.0 in early 2022, independently as a new open source project.

We chose the name DocArray because we want to make something as fundamental and widely-used as NumPy's ndarray. Today, DocArray is the entrypoint of many multimodal AI applications, like the popular DALLE-Flow and DiscoArt. DocArray developers introduced new and powerful features, such as dataclass and document store to improve usability even more. DocArray has allied with open source partners like Weaviate, Qdrant, Redis, FastAPI, pydantic, and Jupyter for integration and most importantly for seeking a common standard.

In the DocArray 0.19 (released on Nov. 15), you can easily represent and process 3D mesh data.

Image by:

(Jina AI, CC BY-SA 4.0)

The future of DocArray

Donating DocArray to the Linux Foundation marks an important milestone where we share our commitment with the open source community openly, inclusively, and constructively.

The next release of DocArray focuses on four tasks:

Representing: support Python idioms for representing complicated, nested multimodal data with ease.
Embedding: provide smooth interfaces for mainstream deep learning models to embed data efficiently.
Storing: support multiple vector databases for efficient persistence and approximate nearest neighbor retrieval.
Transiting: allow fast (de)serialization and become a standard wire protocol on gRPC, HTTP, and WebSockets.

In the following sections, DocArray maintainers Sami Jaghouar and Johannes Messner give you a taste of the next release.

All-in-dataclass

In DocArray, dataclass is a high-level API for representing a multimodal document. It follows the design and idiom of the standard Python dataclass, letting users represent complicated multimodal documents intuitively and process them easily with DocArray's API. The new release makes dataclass a first-class citizen and refactors its old implementation by using pydantic V2.

How to use dataclass

Here's how to use the new dataclass. First, you should know that a Document is a pydantic model with a random ID and the Protobuf interface:

From docarray import Document

 To create your own multimodal data type you just need to subclass from Document:

from docarray import Document
from docarray.typing import Tensor
import numpy as np
class Banner(Document):
   alt_text: str
   image: Tensor
banner = Banner(text='DocArray is amazing', image=np.zeros((3, 224, 224)))

Once you've defined a Banner, you can use it as a building block to represent more complicated data:

class BlogPost(Document):
   title: str
   excerpt: str
   banner: Banner
   tags: List[str]
   content: str

Adding an embedding field to BlogPost is easy. You can use the predefined Document models Text and Image, which come with the embedding field baked in:

from typing import Optional
from docarray.typing import Embedding
class Image(Document):
   src: str
   embedding: Optional[Embedding]
class Text(Document):
   content: str
   embedding: Optional[Embedding]

Then you can represent your BlogPost:

class Banner(Document):
alt_text: str
   image: Image
class BlogPost(Document):
   title: Text
   excerpt: Text
   banner: Banner
   tags: List[str]
   content: Text

This gives your multimodal BlogPost four embedding representations: title, excerpt, content, and banner.

Milvus support

Milvus is an open-source vector database and an open-source project hosted under Linux Foundation AI & Data. It's highly flexible, reliable, and blazing fast, and supports adding, deleting, updating, and near real-time search of vectors on a trillion-byte scale. As the first step towards a more inclusive DocArray, developer Johannes Messner has been implementing Milvus integration.

As with other document stores, you can easily instantiate a DocumentArray with Milvus storage:

from docarray import DocumentArray
da = DocumentArray(storage='milvus', config={'n_dim': 10})

Here, config is the configuration for the new Milvus collection, and n_dim is a mandatory field that specifies the dimensionality of stored embeddings. The code below shows a minimum working example with a running Milvus server on localhost:

import numpy as np
from docarray import DocumentArray
N, D = 5, 128
da = DocumentArray.empty(
   N, storage='milvus', config={'n_dim': D, 'distance': 'IP'}
)  # init
with da:
   da.embeddings = np.random.random([N, D])
print(da.find(np.random.random(D), limit=10))

To access persisted data from another server, you need to specify collection_name, host, and port. This allows users to enjoy all the benefits that Milvus offers, through the familiar and unified API of DocArray.

Embracing open governance

The term "open governance" refers to the way a project is governed — that is, how decisions are made, how the project is structured, and who is responsible for what. In the context of open source software, "open governance" means the project is governed openly and transparently, and anyone is welcome to participate in that governance.

Open governance for DocArray has many benefits:

• DocArray is now democratically run, ensuring that everyone has a say.
• DocArray is now more accessible and inclusive, because anyone can participate in governance.
• DocArray will be of higher quality, because decisions are being made in a transparent and open way.

The development team is taking actions to embrace open governance, including:

• Creating a DocArray technical steering committee (TSC) to help guide the project.
• Opening up the development process to more input and feedback from the community.
• Making DocArray development more inclusive and welcoming to new contributors.

Join the project

If you're interested in open source AI, Python, or big data, then you're invited to follow along with the DocArray project as it develops. If you think you have something to contribute to it, then join the project. It's a growing community, and one that's open to everyone.

This article was originally published on the Jina AI blog and has been republished with permission.

DocArray is hosted by the Linux Foundation to provide an inclusive and standard multimodal data model within the open source community and beyond.

Image by:

opensource.com

AI and machine learning Big data Python What to read next This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License. Register or Login to post a comment.

Compiling The Linux Kernel With LLVM's Clang Matured In 2022

Phoronix - Fri, 01/06/2023 - 08:00
Over the past few years it's become possible to compile the mainline Linux kernel with LLVM/Clang compared to the long-standing dependence on using the GCC compiler. While it's been possible for 3+ years to use the mainline Linux kernel and mainline Clang for building a working x86_64 and AArch64 kernel, the process and support continues to mature...

Pages