After about a month of preparation, following the initial mainnet launch, cryptocurrency Chia coin (XCH) has officially started trading — which means it’s possibly preparing to suck up all of the best SSDs like Ethereum (see how to mine Ethereum) has been gobbling up the best graphics cards. Early Chia calculators suggested an estimated starting price of $20 per XCH. That was way off, but with the initial fervor and hype subsiding, we’re ready to look at where things stand and where they might stabilize.
To recap, Chia is a novel approach to cryptocurrencies, ditching the Proof of Work hashing used by most coins (i.e., Bitcoin, Ethereum, Litecoin, Dogecoin, and others) and instead opting for a new Proof of Time and Space algorithm. Using storage capacity helps reduce the potential power footprint, obviously at the cost of storage. And let’s be clear: The amount of storage space (aka netspace) already used by the Chia network is astonishing. It passed 1 EiB (Exbibyte, or 2^60 bytes) of storage on April 28, and just a few days later it’s approaching the 2 EiB mark. Where will it stop? That’s the $21 billion dollar question.
All of that space goes to storing plots of Chia, which are basically massive 101.4GiB Bingo cards. Each online plot has an equal chance, based on the total netspace, of ‘winning’ the block solution. This occurs at a rate of approximately 32 blocks per 10 minutes, with 2 XCH as the reward per block. Right now, assuming every Chia plot was stored on a 10TB HDD (which obviously isn’t accurate, but roll with it for a moment), that would require about 200,000 HDDs worth of Chia farms.
Assuming 5W per HDD, since they’re just sitting idle for the most part, that’s potentially 1 MW of power use. That might sound like a lot, and it is — about 8.8 GWh per year — but it pales in comparison to the amount of power going into Bitcoin and Ethereum. Ethereum, as an example, currently uses an estimated 41.3 TWh per year of power because it relies primarily on the best mining GPUs, while Bitcoin uses 109.7 TWh per year. That’s around 4,700 and 12,500 times more power than Chia at present, respectively. Of course, Ethereum and Bitcoin are also far more valuable than Chia at current exchange rates, and Chia has a long way to go to prove itself a viable cryptocoin.
Back to the launch, though. Only a few cryptocurrency exchanges have picked up XCH trading so far, and none of them are what we would call major exchanges. Considering how many things have gone wrong in the past (like the Turkish exchange where the founder appears to have walked off with $2 billion in Bitcoins), discretion is definitely the best approach. Initially, according to Coinmarketcap, Gate.io accounted for around 65% of transactions, MXC.com was around 34.5%, and Bibox made up the remaining 0.5%. Since then, MSC and Gate.io swapped places, with MXC now sitting at 64% of all transactions.
By way of reference, Gate.io only accounts for around 0.21% of all Bitcoin transactions, and MXC doesn’t even show up on Coinmarketcap’s list of the top 500 BTC exchange pairs. So, we’re talking about small-time trading right now, on riskier platforms, with a total trading volume of around $27 million in the first day. That might sound like a lot, but it’s only a fraction of Bitcoin’s $60 billion or so in daily trade volume.
Chia started at an initial trading price of nearly $1,600 per XCH, peaked in early trading to peak at around $1,800, and has been on a steady downward slope since then. At present, the price seems to mostly have flattened out (at least temporarily) at around $700. It could certainly end up going a lot lower, however, so we wouldn’t recommend betting the farm on Chia, but even at $100 per XCH a lot of miners/crypto-farmers are likely to jump on the bandwagon.
As with many cryptocoins, Chia is searching for equilibrium right now. 10TB of storage dedicated to Chia plots would be enough for a farm of 100 plots and should in theory account for 0.0005% of the netspace. That would mean about 0.046 XCH per day of potential farming, except you’re flying solo (proper Chia pools don’t exist yet), so it would take on average 43 days to farm a block — and that’s assuming netspace doesn’t continue to increase, which it will. But if you could bring in a steady stream of 0.04 XCH per day, even if we lowball things with a value of $100, that’s $4-$5 per day, from a 10TB HDD that only costs about $250. Scale that up to ten drives and you’d be looking at $45 per day, albeit with returns trending downward over time.
GPU miners have paid a lot more than that for similar returns, and the power and complexity of running lots of GPUs (or ASICs) ends up being far higher than running a Chia farm. In fact, the recommended approach to Chia farming is to get the plots set up using a high-end PC, and then connect all the storage to a Raspberry Pi afterwards for low-power farming. You could run around 50 10TB HDDs for the same amount of power as a single RTX 3080 mining Ethereum.
It’s important to note that it takes a decent amount of time to get a Chia farm up and running. If you have a server with a 64-core EPYC processor, 256GB of RAM, and at least 16TB of fast SSD storage, you could potentially create up to 64 plots at a time, at a rate of around six (give or take) hours per group of plots. That’s enough to create 256 plots per day, filling over 2.5 10TB HDDs with data. For a more typical PC, with an 8-core CPU (e.g, Ryzen 7 5800X or Core i9-11900K), 32GB of RAM, and an enterprise SSD with at least 2.4TB of storage, doing eight concurrent plots should be feasible. The higher clocks on consumer CPUs probably mean you could do a group of plots in four hours, which means 48 plots per day occupying about half of a 10TB HDD. That’s still a relatively fast ramp to a bunch of drives running a Chia farm, though.
In either case, the potential returns even with a price of $100 per XCH amount to hundreds of dollars per month. Obviously, that’s way too high of a return rate, so things will continue to change. Keep in mind that where a GPU can cost $15-$20 in power per month (depending on the price of electricity), a hard drive running 24/7 will only cost $0.35. So what’s a reasonable rate of return for filling up a hard drive or SSD and letting it sit, farming Chia? If we target $20 per month for a $250 10TB HDD, then either Chia’s netspace needs to balloon to around 60EiB, or the price needs to drop to around $16 per XCH — or more likely some combination of more netspace and lower prices.
In the meantime, don’t be surprised if prices on storage shoots up. It was already starting to happen, but like the GPU and other component shortages, it might be set to get a lot worse.
We’ve all been there before—you want to play Tetris with your friends but a 20 mile Game Boy link cable is impractical. That’s where Stacksmashing’s incredibly clever Raspberry Pi Pico project comes into play.
The best Raspberry Pi projects are ones that bring us together, especially right now. Stacksmashing’s using a Raspberry Pi Pico to act as a link cable, capable of connecting two or more friends online. That means you and your buddies can play Tetris against each other in real-time on actual GameBoys from all over the world!
Developing the project involved reverse-engineering the data from an actual link cable. The Raspberry Pi Pico interprets this data and can handle online Game Boy communication for the game of Tetris via a desktop client and server app. This tool is at the center of the project. All of the players connect through this locally hosted hub. In a demo, Stacksmashing managed to get three Game Boys playing Tetris competitively at the same time!
In order to connect a Game Boy to the server, a custom PCB was created. It’s compatible with Game Boys, Game Boy Colors, Game Boy Pockets, Game Boy Advances and even Game Boy Advance SPs. The PCB is soldered to a Pico which communicates with the computer over USB.
You can buy this custom PCB on Gum Road and explore the code used in the desktop server and client apps on GitHub.
The Epic Games v. Apple trial started on Monday, and if you wanted to follow along and listen to Epic CEO Tim Sweeney talk about the “metaverse,” your options were limited. In theory, there’s public access to the trial, like most court proceedings, but since the courthouse is still closed for lockdown, the only access was through the court’s teleconference line that was briefly overrun by screaming teens.
But for anyone looking for more user-friendly options, there’s good news. A surprising, small community of streamers has decided to rebroadcast the trial on streaming platforms — places built for the people who play Fortnite rather than the antitrust policy wonks in the courtroom. For yesterday’s proceeding, I found a handful of YouTube channels and streamers rebroadcasting the hearing online, including Geoff Keighley (gaming’s Ryan Seacrest) on the Game Awards YouTube channel. Keighley’s YouTube stream sat at around 1,000 viewers throughout Monday’s events, featuring an active side chat filled with Fortnite fans and foes negging the day’s witnesses.
Technically, you’re not supposed to do this. The court’s website explicitly tells users that “any recording, copying, or rebroadcasting of a remote court hearing is absolutely prohibited.” Electronic recording devices are often banned from public sessions for the same reason.
But because you’re breaking the court’s rules and not copyright law, streaming the trial is much less likely to result in an account strike than a sports or television live stream. And while conventional media outlets could have their press credentials stripped for defying the ban, most streamers are far enough outside that system that they don’t care.
A separate Fortnite streamer, Golden (112,000 subscribers), was also following the court hearing, providing commentary for his followers interested in the day’s events. In order to avoid ticking off the court, he muted the trial’s audio and provided a link to Keighley’s stream for viewers looking to follow along themselves. He also listed his Discord server in the video’s bio and had three audio rooms dedicated to re-streaming Keighley’s audio. The Discord’s general chat was a mix of armchair antitrust lawyers and others complaining about Sweeney’s bad mic setup.
“Bruh this audio,” one person wrote, responding to Sweeney’s mumbling.
It’s hard to say how many streamers will be active for today’s proceedings — but if you’re hoping to follow along, searching YouTube and Twitch for rebroadcasts might turn up more options than you think.
TSMC produces chips for AMD, but it also now uses AMD’s processors to control the equipment that it uses to make chips for AMD (and other clients too). Sounds like a weird circulation of silicon, but that’s exactly what happens behind the scenes at the world’s largest third-party foundry.
There are hundreds of companies that use AMD EPYC-based machines for their important workloads, sometimes business-critical workloads. Yet, when it comes to mission-critical work, Intel Xeon (and even Intel Itanium and mainframes, rule the world. Luckily for AMD, things have begun to change, and now TSMC has announced that it is now using EPYC-based servers for its mission-critical fab control operations.
“For automation with the machinery inside our fab, each machine needs to have one x86 server to control the operation speed and provision of water, electricity, and gas, or power consumption,” said Simon Wang, Director of Infrastructure and Communication Services Division at TSMC.
“These machines are very costly. They might cost billions of dollars, but the servers that control them are much cheaper. I need to make sure that we have high availability in case one rack is down, then we can use another rack to support the machine. With a standard building block, I can generate about 1,000 virtual machines, which can control 1,000 fab tools in our cleanroom. This will mean a huge cost saving without sacrificing failover redundancy or reliability.”
TSMC started to use AMD EPYC machines quite some time ago for its general data center workloads, such as compute, storage, and, networking. AMD’s 64-core EPYC processors feature 128 PCIe lanes and support up to 4TB of memory, two crucial features for servers used to run virtual machines. But while the infrastructure to support 50,000 of TSMC’s employees globally is very complex and important (some would call it business-critical), it isn’t as important as TSMC’s servers that control fab tools.
Fab tools cost tens or hundreds of millions of dollars and process wafers carrying hundreds of chips that could be used to build products worth tens of thousands of dollars. Each production tool uses one x86 server that controls its operating speed as well as provisions water, electricity, and gas, or power. Sometimes hardware fails, so TSMC runs its workloads in such a way that one server can quickly replace the failed one (naturally, TSMC does not disclose which operating systems and applications it runs at its fabs).
At present TSMC uses HPE’s DL325 G10 platform running AMD EPYC 7702P processors with 64 cores (at 2.0 GHz ~ 3.35 GHz) in datacenters. It also uses servers based on 24-core EPYC 7F72s featuring a 3.20 GHz frequency for its R&D operations. As for machines used in TSMC’s fabs, the foundry keeps their specifications secret.
It is noteworthy that AMD’s data center products are used not only to produce chips, but also to develop them. AMD’s own Radeon Technologies Group uses EPYC processors to design GPUs.
Newly published images may reveal that Intel’s forthcoming Sapphire Rapids CPU might feature roughly 72-80 cores, considerably more cores than initially thought.
Hardware blogger YuuKi_AnS, known for various leaks, has published pictures of what is claimed to be Intel’s Sapphire Rapid chiplets. This time around he removed the dies from the substrate and exposed their flip side. To great surprise, this reveals a 4×5 group of similar elements which are believed to be Sapphire Rapids’ CPU cores. With each chiplet carrying 20 dies (or die candidates, specifically), higher-end Sapphire Rapids processors could have as many as 80 cores, whereas lower-end chips would of course feature fewer cores.
Previously it was reported that Intel’s Sapphire Rapids will come with up to 56 Golden Cove cores, based on a slide that presumably came from Intel. That 56-core Sapphire Rapids CPU was claimed to have a TDP of 350W.
The disassembled Intel Sapphire Rapids sample is exactly the same CPU that was first pictured in February, more than three months ago, so it is not a new sample or an old sample, but the very sample that was obtained from an Intel partner either early in 2021 or late in 2020.
According to market rumors, Intel supplied its Sapphire Rapids samples with 28 enabled cores, and many believed that the company activated only half of the cores that its chiplet physically had (i.e., 7 out of 14). As it transpires, Intel’s Sapphire Rapids features 20 cores per chiplet, which theoretically allows Intel to build SPR processors with up to 80 cores. Meanwhile, 56-core CPUs will have six redundant cores per chiplet, which is way too many. By contrast, two redundant cores per chiplet would enable 72-core CPUs.
Keep in mind that Intel has never confirmed the number of cores it plans for its Sapphire Rapids processors, and plans could always change. So take all this server silicon speculation with a few multi-core grains of salt.
I wish I’d had this to test with Samsung’s Odyssey G9
Microsoft has something of a history of neglecting PC gaming, but it’s trying to change that in a big way — by promising its flagship HaloInfinite will feel like a native PC game when it arrives later this year. We’ve known for many months that it wouldn’t be the Xbox Series X’s killer app, but Microsoft’s trying to make PC gamers feel like first-class citizens too, with features as forward-looking as support for 32:9 super-ultrawide monitors like the Samsung Odyssey G9 I reviewed late last year.
This morning, we learned the game would support ultrawide monitors, in addition to triple-keybinds, advanced graphics options, and both crossplay and cross-progression between Xbox and Windows PCs. But this evening, the Halo Waypoint blog went way deeper, revealing what Infinite will look like at 32:9 and an array of other PC-gamer-friendly details like being able to adjust your field of view up to 120 degrees — and the ability to host your own LAN multiplayer server!
In my Samsung Odyssey G9 review, I bemoaned how even the games that do support 32:9 typically look abnormally, wildly stretched out on each side, providing over a dozen examples of how they don’t properly adjust the shape and curvature of the window they’re opening into the 3D game world. But Halo Infinite PC development lead Mike Romero says the game’s designed to support arbitrary window sizes, and can fit its HUD, menus, and even in-game cutscenes into the wider aspect ratios.
“There’s dozens of people across the studio that have had to put dedicated effort into supporting something like ultrawide throughout the entirety of the game, and I’m very excited to say I think we’ll have some of the best ultrawide support I’ve ever seen in a game,” boasts Romero.
Looking at these Halo Infinite images at 32:9, it’s not immediately clear to me that Microsoft has solved the 32:9 issue — looking at the hill on the right of this image below, for instance, it seems like the game world still might appear a little bit skewed and warped.
But it is clear that you’ll see a lot more of the game world at once this way, if you’re one of the few who’ve ascended to an ultrawide monitor — and have a PC powerful enough to drive it, of course.
Here’s a short list of all the PC-esque perks Microsoft is promising:
LAN play, hosting a local multiplayer server on PC that you can join from both PC and Xbox
Crossplay, restricting ranked matches to input type rather than console vs. PC, with server-side anti-cheat
Adjustable FOV (up to 120 degrees) on both PC and console
Mouse and keyboard support on both PC and console
Triple keyboard and mouse bindings
Visual quality settings up to ultra presets on PC, with individual settings for texture quality, depth of field, anti-aliasing etc.
High refresh rate options
21:9, 32:9 “and beyond” ultrawide monitor support on PC
Minimum and maximum framerate settings on PC
Fixed and dynamic resolution scaling options on PC
Optional borderless fullscreen on PC
FPS and ping overlay on PC
Out-of-game multiplayer invites let you join games through Xbox Live, Discord and Steam
As my colleague Tom Warren notes, there’s still more to learn, like whether the game will support GPU-dependent features on PC like Nvidia’s framerate-enhancing DLSS, ray tracing, and more.
Matthew Wilson 1 day ago Featured Tech News, Operating Systems, Software & Gaming
Last week, a lot of PC gamers ran into issues after updating Windows 10. The update seemed to diminish performance and introduced errors with V-Sync and BSOD crashes. If you’ve been experiencing these issues then we have some good news for you, a new update is now available to fix everything back up.
These issues crept up in two different Windows 10 updates – KB5001330 (mandatory security patch) and KB5000842 (optional update). Microsoft later acknowledged the issue officially, updating its support page to say: “A small subset of users have reported lower than expected performance in games after installing this update. Most users affected by this issue are running games full screen or borderless windowed modes and using two or more monitors”.
Now as reported by Windows Latest, we know that a fix has begun rolling out. Fortunately, the fix was applied server-side, meaning users don’t need to download and install another full Windows 10 update to benefit from the fixes.
The fix itself automatically disables the new code that caused these issues. Microsoft will do some additional bug fixing before enabling the code again in a future update.
Discuss on our Facebook page, HERE.
KitGuru Says: Were any of you impacted by the recent Windows 10 issues? Is everything back to normal now, or are you still encountering problems?
Become a Patron!
Check Also
Razer’s Orochi V2 is a compact wireless mouse with up to 900 hours of battery life
Razer is back with another gaming mouse this week. This time around, the Razer Orochi …
Yesterday marked the 36th anniversary of the first power-on of an Arm processor. Today, the company announced the deep-dive details of its Neoverse V1 and N2 platforms that will power the future of its data center processor designs and span up to a whopping 192 cores and 350W TDP.
Naturally, all of this becomes much more interesting given Nvidia’s pending $40 billion Arm acquisition, but the company didn’t share further details during our briefings. Instead, we were given a deep dive look at the technology roadmap that Nvidia CEO Jensen Huang says makes the company such an enticing target.
Arm claims its new, more focused Noverse platforms come with impressive performance and efficiency gains. The Neoverse V1 platform is the first Arm core to support Scalable Vector Extensions (SVE), bringing up to 50% more performance for HPC and ML workloads. Additionally, the company says that its Neoverse N2 platform, its first IP to support newly-announced Arm v9 extensions, like SVE2 and Memory Tagging, delivers up to 40% more performance in diverse workloads.
Additionally, the company shared further details about its Neoverse Coherent Mesh Network (CMN-700) that will tie together the latest V1 and N2 designs with intelligent high-bandwidth low-latency interfaces to other platform additives, such as DDR, HBM, and various accelerator technologies, using a combination of both industry-standard protocols, like CCIX and CXL, and Arm IP. This new mesh design serves as the backbone for the next generation of Arm processors based on both single-die and multi-chip designs.
If Arm’s performance projections pan out, the Neoverse V1 and N2 platforms could provide the company with a much faster rate of adoption in multiple applications spanning the data center to the edge, thus putting even more pressure on industry x86 stalwarts Intel and AMD. Especially considering the full-featured connectivity options available for both single- and multi-die designs. Let’s start with the Arm Neoverse roadmap and objectives, then dive into the details of the new chip IP.
Arm Neoverse Platform Roadmap
Image 1 of 15
Image 2 of 15
Image 3 of 15
Image 4 of 15
Image 5 of 15
Image 6 of 15
Image 7 of 15
Image 8 of 15
Image 9 of 15
Image 10 of 15
Image 11 of 15
Image 12 of 15
Image 13 of 15
Image 14 of 15
Image 15 of 15
Arm’s roadmap remains unchanged from the version it shared last year, but it does help map out the steady cadence of improvements we’ll see over the next few years.
Arm’s server ambitions took flight with the A-72 in 2015, which was equivalent to the performance and performance-per-watt of a traditional thread on a standard competing server architecture.
Arm says its current-gen Neoverse N1 cores, which powers AWS Graviton 2 chips and Ampere’s Altra, equals or exceeds a ‘traditional’ (read: x86) SMT thread. Additionally, Arm says that, given N1’s energy efficiency, one N1 core can replace three x86 threads but use the same amount of power, providing an overall 40% better price-vs-performance ratio. Arm chalks much of this design’s success up to the Coherent Mesh Network 600 (CMN-600) that enables linear performance scaling as core counts increase.
Arm has revised both its core architecture and the mesh for the new Neoverse V1 and N2 platforms that we’ll cover today. Now they support up to 192 cores and 350W TDPs. Arm says the N2 core will take the uncontested lead over an SMT thread on competing chips and offers superior performance-per-watt.
Additionally, the company says that the Neoverse V1 core will offer the same performance as competing cores, marking the first time the company has achieved parity with two threads running on an SMT-equipped core. Both chips utilize Arm’s new CMN-700 mesh that enables either single-die or multi-chip solutions, offering customers plenty of options, particularly when deployed with accelerators.
Ts one would expect, Arm’s Neoverse N2 and V1 target hyperscale and cloud, HPC, 5G, and the infrastructure edge markets. Customers include Tencent, oracle Cloud with Ampere, Alibaba, AWS with Graviton 2 (which is available in 70 out of 77 AWS regions). Arm also has two exascale-class supercomputer deployments planned with Neoverse V1 chips: SiPearl “Rhea” and the ETRI K-AB21.
Overall, ARM claims that its Neoverse N2 and V1 platforms will offer best-in-class compute, performance-per-watt, and scalability over competing x86 server designs.
Arm Neoverse V1 Platform ‘Zeus’
Image 1 of 3
Image 2 of 3
Image 3 of 3
Arm’s existing Neoverse N1 platform scales from the cloud to the edge, encompassing everything from high-end servers to power-constrained edge devices. The next-gen Neoverse N2 platform preserves that scalability across a spate of usages. In contrast, Arm designed the Neoverse V1 ‘Zeus’ platform specifically to introduce a new performance tier as it looks to more fully penetrate HPC and machine learning (ML) applications.
The V1 platform comes with a wider and deeper architecture that supports Scalable Vector Extensions (SVE), a type of SIMD instruction. The V1’s SVE implementation runs across two lanes with a 256b vector width (2x256b), and the chip also supports the bFloat16 data type to provide enhanced SIMD parallelism.
With the same (ISO) process, Arm claims up to 1.5x IPC increase over the previous-gen N1 and a 70% to 100% improvement to power efficiency (varies by workload). Given the same L1 and L2 cache sizes, the V1 core is 70% larger than the N1 core.
The larger core makes sense, as the V-series is optimized for maximum performance at the cost of both power and area, while the N2 platform steps in as the design that’s optimized for power-per-watt and performance-per-area.
Per-core performance is the primary objective for the V1, as it helps to minimize the performance penalties for GPUs and accelerators that often end up waiting on thread-bound workloads, not to mention to minimize software licensing costs.
Arm also tuned the design to provide exceptional memory bandwidth, which impacts performance scalability, and next-gen interfaces, like PCIe 5.0 and CXL, provide I/O flexibility (much more on that in the mesh section). The company also focused on performance efficiency (a balance of power and performance).
Finally, Arm lists technical sovereignty as a key focus point. This means that Arm customers can own their own supply chain and build their entire SoC in-country, which has become increasingly important for key applications (particularly defense) among heightened global trade tensions.
Image 1 of 2
Image 2 of 2
The Neoverse V1 represents Arm’s highest-performance core yet, and much of that comes through a ‘wider’ design ethos. The front end has an 8-wide fetch, 5-8 wide decode/rename unit, and a 15-wide issue into the back end of the pipeline (the execution units).
As you can see on the right, the chip supports HBM, DDR5, and custom accelerators. It can also scale out to multi-die and multi-socket designs. The flexible I/O options include the PCIe 5 interface and CCIX and CXL interconnects. We’ll cover the Arm’s mesh interconnect design a bit later in the article.
Additionally, Arm claims that, relative to the N1 platform, SVE contributes to a 2x increase in floating point performance, 1.8x increase in vectorized workloads, and 4x improvement in machine learning.
Image 1 of 7
Image 2 of 7
Image 3 of 7
Image 4 of 7
Image 5 of 7
Image 6 of 7
Image 7 of 7
One of V1’s biggest changes comes as the option to use either the 7nm or 5nm process, while the prior-gen N1 platform was limited to 7nm only. Arm also made a host of microarchitecture improvements spanning the front end, core, and back end to provide big speedups relative to prior-gen Arm chips, added support for SVE, and made accommodations to promote enhanced scalability.
Here’s a bullet list of the biggest changes to the architecture. You can also find additional details in the slides above.
Front End:
Net of 90% reduction in branch mispredicts (for BTB misses) and a 50% reduction in front-end stalls
V1 branch predictor decoupled from instruction fetch, so the prefetcher can run ahead and prefetch instruction into the instruction cache
Widened branch prediction bandwidth to enable faster run-ahead to (2x32b per cycle)
Increased capacity of the Dual-level BTB (Branch Target Buffers) to capture more branches with larger instruction footprints and to lower the taken branch latency, improved branch accuracy to reduce mispredicts
Enhanced ability to redirect hard-to-predict branches earlier in the pipeline, at fetch time, for faster branch recovery, improving both performance and power
Mid-Core:
Net increase of 25% in integer performance
Micro-Op (MOP) Cache: L0 decoded instruction cache optimizes the performance of smaller kernels in the microarchitecture, 2x increase in fetch and dispatch bandwidth over N1, lower-latency decode pipeline by removing one stage
Added more instruction fusion capability, improves performance end power efficiency for most commonly-used instruction pairs
OoO (Out of Order) window increase by 2X to enhance parallelism. Also increased integer execution bandwidth with a second branch execution unit and a fourth ALU
SIMD and FP Units: Added a new SVE implementation — 2x256b operations per cycle. Doubled raw execute capability from 2x128b pipelines in N1 to 4x128b in V1. Slide 10 — 4x improvement in ML performance
Back End:
45% increase to streaming bandwidth by increasing load/store address bandwidth by 50%, adding a third load data address generation unit (AGU – 50% increase)
To improve SIMD and integer floating point execution, added a third load data pipeline and improved load bandwidth for integer and vector. Doubled store bandwidth and split scheduling into two pipes
Load/store buffer window sizes increased. MMU capacity, allow for a larger number of cache translations
Reduce latencies in L2 cache to improve single-threaded performance (slide 12)
This diagram shows the overall pipeline depth (left to right) and bandwidth (top to bottom), highlighting the impressive parallelism of the design.
Image 1 of 6
Image 2 of 6
Image 3 of 6
Image 4 of 6
Image 5 of 6
Image 6 of 6
Arm also instituted new power management and low-latency tools to extend beyond the typical capabilities of Dynamic Voltage Frequency Scaling (DVFS). These include the Max Power Mitigation Mechanism (MPMM) that provides a tunable power management system that allows customers to run high core-count processors at the highest possible frequencies, and Dispatch Throttling (DT), which reduces power during certain workloads with high IPC, like vectorized work (much like we see with Intel reducing frequency during AVX workloads).
At the end of the day, it’s all about Power, Performance, and Area (PPA), and here Arm shared some projections. With the same (ISO) process, Arm claims up to 1.5x IPC increase over the previous-gen N1 and a 70% to 100% improvement to power efficiency (varies by workload). Given the same L1 and L2 cache sizes, the V1 core is 70% larger than the N1 core.
The Neoverse V1 supports Armv8.4, but the chip also borrows some features from future v8.5 and v8.6 revisions, as shown above.
Arm also added several features to manage system scalability, particularly as it pertains to partitioning shared resources and reducing contention, as you can see in the slides above.
Image 1 of 8
Image 2 of 8
Image 3 of 8
Image 4 of 8
Image 5 of 8
Image 6 of 8
Image 7 of 8
Image 8 of 8
Arm’s Scalable Vector Extensions (SVE) are a big draw of the new architecture. Firstly, Arm doubled compute bandwidth to 2x256b with SVE and provides backward support for Neon at 4x128b.
However, the key here is that SVE is vector length agnostic. Most vector ISAs have a fixed number of bits in the vector unit, but SVE lets the hardware set the vector length in bits. However, in software, the vectors have no length. This simplifies programming and enhances portability for binary code between architectures that support different bit widths — the instructions will automatically scale as necessary to fully utilize the available vector bandwidth (for instance, 128b or 256b).
Arm shared information on several fine-grained instructions for the SVE instructions, but much of those details are beyond the scope of this article. Arm also shared some simulated V1 and N2 benchmarks with SVE, but bear in mind that these are vendor-provided and merely simulations.
ARM Neoverse N2 Platform ‘Perseus’
Image 1 of 17
Image 2 of 17
Image 3 of 17
Image 4 of 17
Image 5 of 17
Image 6 of 17
Image 7 of 17
Image 8 of 17
Image 9 of 17
Image 10 of 17
Image 11 of 17
Image 12 of 17
Image 13 of 17
Image 14 of 17
Image 15 of 17
Image 16 of 17
Image 17 of 17
Here we can see the slide deck for the N2 Perseus platform, with the key goals being a focus on scale-out implementations. Hence, the company optimized the design for performance-per-power (watt) and performance-per-area, along with a healthier dose of cores and scalability. As with the previous-gen N1 platform, this design can scale from the cloud to the edge.
Neoverse N2 has a newer core than the V1 chips, but the company isn’t sharing many details yet. However, we do know that N2 is the first Arm platform to support Armv9 and SVE2, which is the second generation of the SVE instructions we covered above.
Arm claims a 40% increase in single-threaded performance over N1, but within the same power and area efficiency envelope. Most of the details about N2 mirror those we covered with V1 above, but we included the slides above for more details.
Image 1 of 20
Image 2 of 20
Image 3 of 20
Image 4 of 20
Image 5 of 20
Image 6 of 20
Image 7 of 20
Image 8 of 20
Image 9 of 20
Image 10 of 20
Image 11 of 20
Image 12 of 20
Image 13 of 20
Image 14 of 20
Image 15 of 20
Image 16 of 20
Image 17 of 20
Image 18 of 20
Image 19 of 20
Image 20 of 20
Arm provided the above benchmarks, and as with all vendor-provided benchmarks, you should take them with a grain of salt. We have also included the test notes at the end of the album for further perusal of the test configurations.
Arm’s SPEC CPU 2017 single-core tests show a solid progression from N1 to N2, and then a higher jump in performance with the V1 platform. The company also provided a range of comparisons against the Intel Xeon 8268 and an unspecified 40-core Ice Lake Xeon system, and the EPYC Rome 7742 and EPYC Milan 7763.
Coherent Mesh Network (CMN-700)
Image 1 of 5
Image 2 of 5
Image 3 of 5
Image 4 of 5
Image 5 of 5
Arm allows its partners to adjust core counts, cache sizes, and use different types of memory, such as DDR5 and HBM and select various interfaces, like PCIe 5.0, CXL, and CCIX, requiring a very flexible underlying design methodology. Add in the fact that Neoverse can span from the cloud and edge to 5G, and the interconnect also has to be able to span a full spectrum of various power points and compute requirements. That’s where the Coherent Mesh Network 700 (CMN-700) steps in.
Arm focuses on security through compliance and standards, Arm open-source software, and ARM IP and architecture, all rolled under the SystemReady umbrella that serves as the underpinning of the Neoverse platform architecture.
Arm provides customers with reference designs based on its own internal work, with the designs pre-qualified in emulated benchmarks and workload analysis. Arm also provides a virtual model for software development too.
Customers can then take the reference design, choose between core types (like V-, N- or E-Series) and alter core counts, core frequency targets, cache hierarchy, memory (DDR5, HBM, Flash, Storage Class Memory, etc.), and I/O accommodations, among other factors. Customers also dial in parameters around the system-level cache that can be shared among accelerators.
There’s also support for multi-chip integration. This hangs off the coherent mesh network and provides plumbing for I/O connectivity options and multi-chip communication accommodations through interfaces like PCIe, CXL, CCIX, etc.
The V-Series CPUs address the growth of heterogeneous workloads by providing enough bandwidth for accelerators, support for disaggregated designs, and also multi-chip architectures that help defray the slowing Moore’s Law.
These types of designs help address the fact that the power budget per SoC (and thus thermals) is increasing, and also allow scaling beyond the reticle limits of a single SoC.
Additionally, I/O interfaces aren’t scaling well to smaller nodes, so many chipmakers (like AMD) are keeping PHYs on older nodes. That requires robust chip-to-chip connectivity options.
Here we can see the gen-on-gen comparison with the current CMN-600 interface found on the N1 chips. The CMN-700 mesh interface supports four times more cores and system-level cache per die, 2.2x more nodes (cross points) per die, 2.5x memory device ports (like DRAM, HBM) per die, and 8x the number of CCIX device ports per die (up to 32), all of which supplies intense scalability.
Image 1 of 4
Image 2 of 4
Image 3 of 4
Image 4 of 4
Arm improved cross-sectional bandwidth by 3X, which is important to provide enough bandwidth for scalability of core counts, scaling out with bandwidth-hungry GPUs, and faster memories, like DDR5 and HBM (the design accommodates 40 memory controllers for either/or DDR and HBM). Arm also has options for double mesh channels for increased bandwidth. Additionally, a hot spot reroute feature helps avoid areas of contention on the fabric.
The AMBA Coherent Hub Interface (CHI) serves as the high-performance interconnect for the SoC that connects processors and memory controllers. Arm improved the CHI design and added intelligent heuristics to detect and control congestion, combine operations to reduce transactions, and conduct data-less writes, all of which help reduce traffic on the mesh. These approaches also help with multi-chip scaling.
Memory partitioning and monitoring (MPAM) helps reduce the impact of noisy neighbors on system-level cache and isolates VMs to keep them from hogging system level cache (SLC). Arm also extends this software-controlled system to the memory controller as well. All this helps to manage shared resources and reduce contention. The CPU, accelerator, and PCIe interfaces all have to work together as well, so the design applies the same traffic management techniques between those units, too.
Image 1 of 4
Image 2 of 4
Image 3 of 4
Image 4 of 4
The mesh supports multi-chip designs through CXL or CCIX interfaces, and here we see a few of the use cases. CCIX is typically used inside the box or between the chips, be that heterogenous packages, chiplets, or multi-socket. In contrast, CXL steps in for memory expansion or pools of memory shared by multiple hosts. It’s also used for coherent accelerators like GPUs, NPUs, and SmartNICs, etc.
Slide 14 shows an example of a current connection topology — PCIe connects to the DPU (Data Plane Unit – SmartNic), which then provides the interconnection to the compute accelerator node. This allows multiple worker nodes to connect to shared resources.
Slide 15 shows us the next logical expansion of this approach — adding disaggregated memory pools that are shared between worker nodes. Unfortunately, as shown in slide 16, this creates plenty of bottlenecks and introduces other issues, such as spanning the home nodes and system-level cache across multiple dies. Arm has an answer for that, though.
Image 1 of 4
Image 2 of 4
Image 3 of 4
Image 4 of 4
Addressing those bottlenecks requires a re-thinking of the current approaches to sharing resources among worker nodes. Arm designed a multi-protocol gateway with a new AMBA CXS connection to reduce latency. This connection can transport CCIX 2.0 and CXL 2.0 protocols much faster than conventional interconnections. This system also provides the option of using a Flit data link layer that is optimized for the ultimate in low-latency connectivity.
This new design can be tailored for either socket-to-socket or multi-die compute SoCs. As you can see to the left on Slide 17, this multi-protocol gateway can be used either with or without a PCIe PHY. Removing the PCIe PHY creates an optimized die-to-die gateway for lower latency for critical die-to-die connections.
Arm has also devised a new Super Home Node concept to accommodate multi-chip designs. This implementation allows composing the system differently based on whether or not it is a homogenous design (direct connections between dies) or heterogeneous (compute and accelerator chiplets) connected to an I/O hub. The latter design is becoming more attractive because I/O doesn’t scale well to smaller nodes, so using older nodes can save quite a bit of investment and help reduce design complexity.
Thoughts
ARM’s plans for a 30%+ gen-on-gen IPC growth rate stretch into the next three iterations of its existing platforms (V1, N2, Poseidon) and will conceivably continue into the future. We haven’t seen gen-on-gen gains in that range from Intel in recent history, and while AMD notched large gains with the first two Zen iterations, as we’ve seen with the EPYC Milan chips, it might not be able to execute such large generational leaps in the future.
If ARM’s projections play out in the real world, that puts the company not only on an intercept course with x86 (it’s arguably already there in some aspects), but on a path to performance superiority.
Wrapping in the amazingly well-thought-out coherent mesh design makes these designs all the more formidable, especially in light of the ongoing shift to offloading key workloads to compute accelerators of various flavors. Additionally, bringing complex designs, like chiplet, multi-die, and hub and spoke designs all under one umbrella of pre-qualified reference designs could help spur a hastened migration to Arm architectures, at least for the cloud players. That attraction of licensable interconnections that democratize these complex interfaces is definitely yet another arrow in Arm’s quiver.
Perhaps one of the most surprising tidbits of info that Arm shared in its presentations was one of the smallest — a third-party firm has measured that more than half of AWS’s newly-deployed instances run on Graviton 2 processors. Additionally, Graviton 2-powered instances are now available in 70 out of the 77 AWS regions. It’s natural to assume that those instances will soon have a newer N2 or V1 architecture under the hood.
This type of uptake, and the economies of scale and other savings AWS enjoys from using its own processors, will force other cloud giants to adapt with their own designs in kind, perhaps touching off the type of battle for superiority that can change the entire shape of the data center for years to come. There simply isn’t another vendor more well-positioned to compete in a world where the hyperscalers and cloud giants battle it out with custom silicon than Arm.
The recently released Ubuntu 21.04, is the latest version of the popular Linux distribution and with the latest release we see Wayland arrive as the default display server. But it seems that those wishing to upgrade from a previous release, for example 20.04 / 20.10 are unable to. According to OMG! Ubuntu! there is a bug which is preventing users from updating to the latest release.
The problem seems to be in shim, the bootloader which handles the secure boot process for the OS. Users running the wrong version of their EFI – an early one – can see their PC fail to boot after the upgrade. A new version of shim is on the way to fix the issue, but users who are sure their hardware is new enough to sidestep the problem can manually force an upgrade at their own risk from the command line.
“Due to the severity of the issue we shouldn’t be encouraging people to upgrade at this point in time,” wrote Canonical software engineer Brian Murray in a post to the Ubuntu Developer mailing list. “After we have a new version of shim signed will make it available in Ubuntu 21.04 and then enable upgrades.”
The exact nature of the hardware likely to fail is still unclear. We reached out to Canonical software engineer Dave Jones on Twitter, who suggested modern machines would be unaffected but older machines such as a ThinkPad 420 from 2011 and a MacBook Air from 2012 were affected by the bug.
Codenamed “Hirsute Hippo”, Ubuntu 21.04 brings support for Wayland display server, with Xorg still available for those that need it. Native Active Directory integration and a performance-optimized certified Microsoft SQL Server are new features for this release.
Washington, DC’s police department has confirmed its servers have been breached after hackers began leaking its data online, The New York Times reports. In a statement, the department confirmed it was aware of “unauthorized access on our server” and said it was working with the FBI to investigate the incident. The hacked data appears to include details on arrests and persons of interest.
The attack is believed to be the work of Babuk, a group known for its ransomware attacks. BleepingComputer reports that the gang has already released screenshots of the 250GB of data it’s allegedly stolen. One of the files is claimed to relate to arrests made following the January Capitol riots. The group warns it will start leaking information about police informants to criminal gangs if the police department doesn’t contact it within three days.
Washington’s police force, which is called the Metropolitan Police Department, is the third police department to be targeted in the last two months, according to the NYT, following attacks by separate groups against departments in Presque Isle, Maine and Azusa, California. The old software and systems used by many police forces are believed to make them more vulnerable to such attacks.
The targeting of police departments is believed to be part of a wider trend of attacks targeting government bodies. Twenty-six agencies are believed to have been hit by ransomware in this year alone, with 16 of them seeing their data released online, according to Emsisoft ransomware analyst Brett Callow, Sky News notes. The Justice Department reports that the average ransom demand has grown to over $100,000 as the attacks surged during the pandemic.
The Biden administration is attempting to improve the USA’s cybersecurity defenses, with an executive order expected soon. The Justice Department also recently formed a task force to help defend against ransomware attacks, The Wall Street Journal reports. “By any measure, 2020 was the worst year ever when it comes to ransomware and related extortion events,” acting Deputy Attorney General John Carlin, who’s overseeing the task force, told the WSJ. “And if we don’t break the back of this cycle, a problem that’s already bad is going to get worse.”
If you’re like me, you spend a little too much time on a few websites, particularly social media sites. Try as you might to avoid them, you find yourself unconsciously typing in Facebook, Reddit, or something else into your search bar. There are a ton of browser extensions, desktop apps, and router rules you can configure to help, but I wanted something more physical.
This project uses light switches and an at-home DNS server running on a Raspberry Pi Zero to turn on or off access to specific websites. If I want access to a site I have blocked, I need to physically get up and turn off the switch — something that’s slightly more difficult than turning off a browser extension.
What You’ll Need To Make A Website Off Switch
Raspberry Pi Zero W, SD Card, and Power Supply
Two single-pole light switches
A PVC double-gang switch box
A plastic 2-toggle switch plate
Four male-female jumper wires
Phillips and flathead screwdrivers
How to Build a Website Off Switch With a Raspberry Pi
1. Set up your Raspberry PiZero W and connect to it remotely. If you don’t know how, see our tutorial on how to set up a headless Raspberry Pi.
2. Update and Install git if not already installed.
4. Run the installation command from within the newly cloned directory. This will take care of installing all base dependencies, installing the dnsmasq tool, and ensuring it’s running.
cd internet_kill_switch
make install
5. Ensure dnsmasq is running by entering the following command. To exit, use Ctrl + C. You should see status “Running” in green.
sudo systemctl status dnsmasq
# Use Ctrl + C to exit
6. Try running the software to ensure everything works accordingly. You should see a few log messages with no errors. Exit the software with Ctrl + C.
make run
# Use Ctrl + C to exit
7. Wire the male end of each jumper cable to the switches. You should be able to tuck the pin under the screw and tighten it until it’s snug. These switches are designed for high AC voltage, but will work just the same for our project.
8. Attach the female end of the jumper cables to the Raspberry Pi. Pin order doesn’t matter, as long as one end of the switch goes to a GPIO pin, and the other goes to a ground pin. The default code expects GPIO (BCM) pins 18 and 24, though this is configurable.
9. Tuck the raspberry pi into the PVC double-gang switch box, and affix it to the back with hot glue.
10. Run the power cable through the switch box inlet, and connect it to the Raspberry Pi.
11. Attach both switches to the switch box using the screws they came with.
12. Place the switch plate on top of the switches, and attach it with the screws it came with.
13. Label which switch belongs to which site. Keep track of which switch is attached to which pin.
14. Open the src/toggle_switches.py file for editing.
cd src/
nano toggle_switches.py
15. Modify toggle_switches.py to list the websites you want to block for each switch. You can add or remove switches by modifying the SWITCHES list, and adding new objects of the class SwitchConfig. The accepted parameters are the BCM pin number, followed by a list of sites to block.
# Example: The switch connected to BCM pin 18 blocks social media, and the switch connected to BCM pin 24 blocks video websites
SWITCHES = [
SwitchConfig(18, ['facebook.com', 'instagram.com']),
SwitchConfig(24, ['netflix.com', 'youtube.com', 'vimeo.com']),
]
16. Run the application to test it, and make note of the IP address it prints out when starting – that’s the IP address of your Pi.
cd ~/internet_kill_switch
make run
# IP Address is 10.0.0.25
17. Configure your DNS servers on your computer with that IP address. I’m configuring mine in System Preferences > Network of my MacBook, but you can also configure it at the router level.
18. Test the switches on the machine that now uses your Pi as a DNS server – in my case, my MacBook. With the switches in the off position, you should be able to access websites freely. Once you flip the switches to the on position, after a short delay, you should receive a network error when trying to visit that website.
19. To confirm, use the following command on your computer, substituting google.com with the site you’ve chosen to block:
# When the switch is off
dig +short google.com
172.217.164.238
# When the switch is on
dig +short google.com
127.0.0.1
20. Edit your /etc/rc.local file to include the following line, so this software runs when your Pi reboots.
sudo nano /etc/rc.local
# Add this line before the very last line
cd /home/pi/internet_kill_switch/src && sudo ../env/bin/python app.py &
21. Mount your switches. If you would like to put this switchbox on the wall, you’ll have to find a place to mount it that’s close to an outlet so you can power the Pi. Alternatively, you can just leave the switchbox on a table or shelf somewhere.
Enjoy, and if ever you’re not able to load websites, try removing the Pi’s IP address from your DNS servers – there’s likely a problem.
Intel’s new CEO Pat Gelsinger helmed his first full earnings call for the company yesterday, echoing other industry leaders in saying that the ongoing industry-wide chip shortages could last several more years. Intel also reported that it set a quarterly record for the most notebook PC chips sold in its history, but also reported a sudden slump in data center sales that found revenue dropping 20%, a record for the segment, as the number of units shipped and average selling prices both declined dramatically. Intel also posted its lowest profitability for its server segment in recent history, which surely is exacerbated by AMD’s continuing share gains and Intel’s resultant price cuts.
Intel’s first-quarter 2021 results were strong overall; the company raked in $18.6B in revenue, beating its January guidance by $1.1B (and analyst estimates). However, the impressive quarterly revenue is tempered by the fact that gross margins dropped to 58.4%, a 6.1 ppt decline year-over-year (YoY).
Chips shortages are top of mind, and Gelsinger said he expects the industry-wide chip shortages to last for several more years. A shortage of substrate materials and chip packaging capacity has hamstrung the industry, and Gelsinger said that Intel is bringing some of its chip packaging back in-house to improve its substrate supply. That new capacity comes online in Q2 and will “increase the availability of millions of units in 2021.”
Image 1 of 5
Image 2 of 5
Image 3 of 5
Image 4 of 5
Image 5 of 5
Intel’s Client Computing Group (CCG), which produces both notebook and desktop PC chips, posted record notebook PC sales that were up 54% YoY. However, average selling prices (ASPs) dropped a surprising 20% YoY, which Intel chalked up to selling more lower-end devices, like low-end consumer and education (Chromebooks). Competitive pricing pressure from AMD’s Ryzen 5000 Mobile processors also surely comes into play here.
Meanwhile, desktop PC volumes dropped 4% YoY while average selling prices dropped 5%, a continuation of an ongoing trend that’s exacerbated by tough competition from AMD’s Ryzen 5000 processors. All told, that led to CCG revenue being up 8% YoY. Intel’s consumer chips now account for 59% of its revenue.
Image 1 of 3
Image 2 of 3
Image 3 of 3
Intel’s Data Center Group (DCG) results were far less impressive. The unit’s $5.6B in revenue was down 20% YoY, and Intel also shipped 13% fewer units than last year. Intel chalks this up to cloud inventory digestion, meaning that companies that ordered large numbers of chips in the past are still deploying that inventory. This is the second quarter in a row Intel has cited this as a reason for lowered revenue.
Intel formally launched its Ice Lake Xeon processors earlier this month, and it isn’t unheard of for large customers to pause purchases in the months before large product launches that bring big performance and efficiency gains. A complex mix of other factors could also contribute, like cloud service providers moving to their own chip designs and possible continued market share gains by AMD. Still, Intel expects its server chips sales to rebound in the second half of the year. We’ll learn more when AMD releases its results, also with the market share reports that will arrive in a few weeks.
Intel’s average selling prices for its server chips also dropped 14%. Intel cites an increase in sales of lower-cost networking SoCs as a contributor, but the company has also cut pricing drastically to compete with AMD’s EPYC processors, which ultimately has an impact.
DCG operating margins weighed in at 23%, a record low for a segment that typically runs in the 40% range, with Intel citing the impact of its 10nm Xeon ramp as a contributor. It’s also noteworthy that 10nm is less profitable than Intel’s 14nm process, so moving from 14nm Cascade Lake to 10nm Ice Lake will further impact margins.
Image 1 of 7
Image 2 of 7
Image 3 of 7
Image 4 of 7
Image 5 of 7
Image 6 of 7
Image 7 of 7
Intel is moving forward with its plans to invest $20 billion this year as it expands its manufacturing capacity, with a good portion of that dedicated to its Intel Foundry Services (ISF) that will make chips for other companies, much like we see with other third-party foundries like TSMC. Gelsinger says Intel is engaging with 50+ potential customers already.
Gelsinger also noted that Intel has onboarded 2,000 new engineers this year and expects to bring on “several thousand” more later in the year. However, he didn’t provide a frame of reference as to how that compares to Intel’s normal hiring rate, which is an important distinction in an engineering-heavy company with over 110,000 employees.
Intel expects to ramp up its investments, and noted that it would reduce its stock buybacks as it plows more money into its investments. Intel guides for a 57% gross margin for the second quarter on $17.8B and a $1.05 EPS.
The latest Ubuntu release is upon us. Ubuntu 21.04 “Hirsute Hippo” sees many small updates that will be supported for the next nine months. This may not be an LTS (Long Term Support) release but it does offer a few hints as to the direction in which Ubuntu will be heading. You can start the download now while reading this article, if you’re so inclined.
Ubuntu 21.04’s biggest change is a change from Xorg display server to Weyland, which is a welcome move given the age of Xorg.
“Wayland, from a high-level view, is about making the compositor the central process and the X server the optional add-on you can activate for legacy X applications,” the Ubuntu Wiki entry for Wayland reads.
In a press release, Canonical stated that “Ubuntu 21.04 uses Wayland by default, a significant leap forward in security. Firefox, OBS Studio and many applications built with Electron and Flutter take advantage of Wayland automatically, for smoother graphics and better fractional scaling.”
What else has changed? In Ubuntu 21.04 our home directory is now truly private. In previous releases users could read each others home directories including important files and personalized settings. This has been locked down for Ubuntu 21.04. Ubuntu 21.04 also sees a new dark theme, Yaru which includes new icons. It also brings the latest release of Gnome Shell 3.38.5, sadly not Gnome 40 with its lush new interface and curved edges but you can of course install that yourself.
The Ubuntu installer, already a great example of a smooth on-boarding experience, has been updated to include Active Directory for system administrators to manage and configure systems. If you encrypt your hard then Ubuntu 21.04 also introduces a recovery key, just in case the worst happens. Digging down into the kernel and we see Linux Kernel 5.11 by default.
Ubuntu 21.04 is also available for the Raspberry Pi 4 and it works with Raspberry Pi models 2 – 400. But as mentioned in the release notes, there is a quirk that affects Pi users. “On the desktop image, the default user does not belong to the “dialout” group with the result that they do not have non-root access to the GPIO pins” – Ubuntu 21.04 release notes. The fix for this issue is to add your user to the “dialout” group, something that Arduino users have done for many years. Adding your user to the dialout group enables the user to communicate with the GPIO and many other USB to serial devices.
Microsoft is starting to allow Windows 10 testers to access Linux GUI apps. The first preview of support for GUI applications is available today for Windows Insiders, allowing developers to run GUI editors, tools, and applications to build and test Linux apps. It’s a significant extension for Microsoft’s Windows Subsystem for Linux (WSL), after the company added a full Linux kernel to Windows 10 last year.
While it has been possible to run Linux GUI apps within Windows previously using a third-party X server, official support from Microsoft means there’s also GPU hardware acceleration so apps and tools run smoothly. Audio and microphone support is also included out of the box, so Linux devs can easily test or run video players and communications apps.
This is all enabled without Windows users having to use X11 forwarding and without users having to manually start an X server. Microsoft automatically starts a companion system distro when you attempt to run a Linux GUI app, and it contains a Wayland, X server, pulse audio server, and everything else needed to make this work inside Windows. Once you terminate an app and WSL, then this special distro ends, too. All of these components combine to make it super easy to run Linux GUI apps alongside regular Windows apps.
Microsoft is also testing a new eco mode for the Windows Task Manager in this latest test build. It’s an experimental feature that lets you throttle process resources inside Task Manager. It’s really designed to rein in apps that suddenly start taking up lots of system resources, and it could be useful if you want to temporarily throttle back an app.
If you’re interested in testing Linux GUI apps on Windows 10 or this new Task Manager feature, you’ll need to install the latest Windows Insider build 21364 from the Dev Channel. Be warned: these are designed as developer builds and not for machines you rely on daily.
Cerebras, the company behind the Wafer Scale Engine (WSE), the world’s largest single processor, shared more details about its latest WSE-2 today at the Linley Spring Processor Conference. The new WSE-2 is a 7nm update to the original Cerebras chip and is designed to tackle AI workloads with 850,000 cores at its disposal. Cerebras claims that this chip, which comes in an incredibly small 26-inch tall unit, replaces clusters of hundreds or even thousands of GPUs spread across dozens of server racks that use hundreds of kilowatts of power.
The new WSE-2 now wields 850,00 AI-optimized cores spread out over 46,225 mm2 of silicon (roughly 12×12 in.) packed with 2.6 trillion transistors. Cerebras also revealed today that the second-gen chip has 40 GB of on-chip SRAM memory, 20 petabytes of memory bandwidth, and 220 petabits of aggregate fabric bandwidth. The company also revealed that the chip consumes the same 15kW of power as its predecessor but provides twice the performance, which is the benefit of moving to the denser 7nm node from the 16nm used with the previous-gen chip.
These almost unbelievable specifications stem from the fact that the company uses an entire TSMC 7nm wafer to construct one large chip, thus sidestepping the typical reticle limitations of modern chip manufacturing to create a wafer-sized processor. The company builds in redundant cores directly into the hardware, which then leaves room for disabling defective cores, to sidestep the impact of defects during the manufacturing process.
The company accomplishes this feat by stitching together the dies on the wafer with a communication fabric, thus allowing it to work as one large cohesive unit. This fabric provides 220 Petabits/S of throughput for the WSE2, which is slightly more than twice the 100 Petabits/S of the first-gen model. The wafer also includes 40GB of on-chip memory that provides up to 20 Petabytes/S of throughput, both of which are also more than twice that of the previous-gen WSE.
Image 1 of 10
Image 2 of 10
Image 3 of 10
Image 4 of 10
Image 5 of 10
Image 6 of 10
Image 7 of 10
Image 8 of 10
Image 9 of 10
Image 10 of 10
Cerebras hasn’t specified the WSE-2’s clock speeds, but has told us in the past that the first-gen WSE doesn’t run at a very “aggressive” clock (which the company defined as a range from 2.5GHz to 3GHz). We’re now told that the WSE-2 runs at the same clock speeds as the first-gen model, but provides twice the performance within the same power envelope due to its increased system resources. We certainly don’t see those types of generational performance improvements with CPUs, GPUs, or most accelerators. Cerebras says that it has made unspecified changes to the microarchitecture to extract more performance, too.
As you can see below, cores are distributed into tiles, with each tile having its own router, SRAM memory, FMAC datapath, and tensor control. All cores are connected via a 2D mesh low-latency fabric. The company claims these optimizations result in a 2x improvement in wall clock training time with a BERT-style network training that was completed using the same code and compiler used with the first-gen wafer-scale chip.
Image 1 of 4
Image 2 of 4
Image 3 of 4
Image 4 of 4
As before, the chip comes wrapped in a specialized 15U system that’s designed specifically to accommodate the unique characteristics of the wafer-scale device. We’re told that the changes to the first-gen CS-1 system, which you can read about in-depth here, are very minimal in the new CS-2 variant. Given that the most important metrics, like power consumption and the size of the WSE, have remained the same, it makes sense that most of the system is identical.
Cerebras hasn’t specified pricing, but we expect the WSE-2 unit will continue to attract attention from the military and intelligence communities for any multitude of purposes, including nuclear modeling, but Cerebras can’t divulge several of its customers (for obvious reasons). It’s safe to assume they are the types with nearly unlimited budgets, so pricing isn’t a concern. On the public-facing side, the Argonne National Laboratory is using the first systems for cancer research and basic science, like studying black holes.
Cerebras also notes that its compiler easily scaled to exploit twice the computational power, so the software ecosystem that is already in place is supported. As such, the WSE-2 unit can accept standard PyTorch and TensorFlow code that is easily modified with the company’s software tools and APIs. The company also allows customers instruction-level access to the silicon, which stands in contrast to GPU vendors.
Cerebras has working systems already in service now, and general availability of the WSE-2 is slated for the third quarter of 2021.
We use cookies on our website to give you the most relevant experience. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.