UNDERSTANDING LARGE DATA SETS

January 28, 2025

Big data, the larger, more complex data sets from multiple, often non-traditional, sources, is prevalent in eDiscovery. It has overwhelmed the management ability of traditional data processing software and driven the development of advanced computer-based solutions such as TAR and AI.

Let’s look at exactly what big data is and how we can handle it in the eDiscovery process.

  1. DATA
  1. WHAT IS LARGE?  

What exactly is “big data”?  One company that should know is Oracle, which says:

“The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three “Vs.”

(Source: https://www.oracle.com/big-data/what-is-big-data)

Resource center Domo offers this quick and easy calculation:

“The most basic way to tell if data is big data is through how many unique entries the data has. Usually, a big dataset will have at least a million rows. A dataset might have less rows than this and still be considered big, but most have far more.”

(Source: www.domo.com/learn/article/4-ways-to-tell-if-your-data-is-big-data )

The problem here is not just size. Many programs have limits on how much data they can display or analyze. Large datasets are difficult to upload because of their size but firms need to analyze the whole dataset at once. They can’t just look at portions of the data, so they need a tool that will allow them to inspect everything at once but not take so much time in doing that they are not practical to use. .

  • DATA EXPLOSION

The origins of large data sets go back to the 1960s and ‘70s with the establishment of the first data centers and the development of the relational database. By 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. Hadoop, an open-source framework created specifically to store and analyze big data sets, was developed that same year.

The development of open-source frameworks was essential for the growth of big data because they make big data easier to work with and cheaper to store. In the years since then, the volume of big data has skyrocketed. With the advent of the Internet of Things (IoT), more objects and devices are connected to the internet, gathering data on customer usage patterns and product performance. The emergence of machine learning has produced still more data, and the COVID-19 pandemic brought a rapid increase in global data creation since 2020, as most of the world population had to work from home and used the internet for both work and entertainment.

Users, both human and machine, are generating huge amounts of data. So just how much data is there in the world today? Some estimates suggest that the total amount of data on the internet reached 175 zettabytes in 2022.[1] Some studies show that 90% percent of the world’s data was created in the last two years and that every two years, the volume of data across the world doubles in size.

Additionally, there is a substantial amount of replicated data. By the end of this year, 2024, the unique/replicated data ratio is projected to change from 1:9 to 1:10.

(Source:Statista.com)

The world’s data volume has increased dramatically in the past twenty years for several reasons. What led to this explosion in data? First, according to Moore’s Law, digital storage becomes larger, cheaper, and faster with each successive year. Second, with the the advent of cloud databases, previous hard limits on storage size became obsolete as Internet giants such as Google and Facebook used cloud infrastructures to collect massive amounts of data. Companies around the world soon adopted similar big data tactics. And finally, billions of new users gained internet access across the globe, pushing more and more data accumulation.

What does that look like in real terms? The graphic below is instructive:

  • NEW DATA TYPES

The emergence of newer types of data beyond the traditional word processing and spreadsheet platforms common for years in litigation matters is also a fact.

Some common types of emerging data include:

  • Mobile data.
  • Messaging data. 
  • Marketing
  • Medical data. 
  • IoT data. 
  • MORE USERS

Out of the nearly 8 billion people in the world, 5.35 billion of them, or around 66% of the world’s population, have access to the internet. By Q3 of 2023, it was estimated that almost 96 percent of the global digital population used a mobile device to connect to the internet. Global users spend almost 60 percent of their online time browsing the web from their mobile phones. The most popular app activities on mobile were chatting and communicating, as well as listening to music and online banking activities.

(Source: Statista)

  1. LITIGATION
  1. HISTORY

How does all this large data fit into the historical timeline of litigation? The first “large” document case I was directly involved with was a coordinated action in Sacramento Ca in 1986. I maintained an index (no images) of 5m pages loaded into the DOS version of Summation on a Compaq 386 with a 20 MHz Intel 80386 CPU, 1 MB RAM, 16 KB ROM, two 1.2 MB 5¼-inch floppy drives and a 40 MB hard disk drive. The PC cost $7,999 

By the mi-90’s, I was working for the Texas Attorney General in their tobacco litigation.  I ran a coding shop of 3 shifts of 35-50 coders per shift using a LAN with Gravity software. We eventually housed a database of 13 million pages, again no images.

The image below is of the Minnesota tobacco archives where the AG there collected 28,455 boxes stacked in rows up to 12 high, four wide, 70 deep, containing 93 million pages of paper. It was the largest single records collection in the history of tobacco litigation.

In 2011, I began working with an iCONECT database here in New Orleans for the Plaintiffs in the BP case. We had 1 billion pages of emails, word processing documents, spreadsheets, proprietary data applications, and instrumentation reports. The documents were in a private cloud which was accessed by more than 100 outside law firms and their experts representing more than 116,000 individual plaintiffs and at any given point in time, 300 reviewers from over 90 law firms representing various case teams as well as several state attorneys were accessing the database. 

Today there is a Relativity database in the Jan 6 insurrection cases with files for both the US Attorney data and over half of the 1200+ defendants, all of which exceeds 10TB. It contains predominantly emails, texts, video and audio files taken from seized cell phones which were then produced to the defense.

  • STRATEGIES

So, what strategies can we use to handle all this data?  And do any of the ones we used in 1986 still apply?

  1. The Z Factor

The most important is not technological at all. It is what Bruce Markowitz. The SVP at Evolver Legal Services, likes to call “the Z factor”.  We all know about the X factor, the great unknown, but the Z factor is one which is often unarticulated. (https://www.youtube.com/watch?v=k1AEFTdVzy0)

In a nutshell it’s the question I always ask my clients when we start a project. “What is you want to do?”  Bruce says, “what is the end result you need?”.  Take the answer to that question and build your workflow around it.

  • Map Your Data.

I’ve said it over and over, many times. Get with the IT staff and generate a data map. In the old days it was easy …you simply asked where the warehouse with the boxes was. Now you need knowledgeable IT staff to show you the way. You’re Lewis and Clark, they’re Sacajawea. You’re not going to get where you need to go without them.

Once you have a data map, you can decide:

  • What data might potentially be relevant;
  • Where that data is located;
  • Who is in charge of managing that data; and
  • How to make sure it is preserved.
  • Litigation Holds

This is not a one and done. You need a comprehensive hold that is managed by someone who crafts a comprehensive hold and follows up on it periodically.

  • Analytics

Not something we had in the paper days. The sooner you can use data analytics to quantify the key issues of potential litigation, the better. A good analytics assessment, even if just with a significant sampling of data, can be crucial in developing a case strategy.

  • Standardization

Again, not a factor in the early days of large data cases as we struggled to work with paper documents and early formats of electronic data which each required their own OS or specialty database.

Now a key feature of ESI processing is to put all the data in a common format for review.  Agreeing on this format can be problematic however so remember that Rule 26(b)(1) of the Federal Rules of Civil Procedure, requires the parties to conduct a pre-trial meeting to agree on a proposed discovery plan which should include this component.

Although including determining their approach to eDiscovery. To make a strong case for favorable proportionality, parties must understand their data, build strong collaboration across business units, and utilize eDiscovery software to enhance collaboration and streamline the process.

  • Data Exchange Protocol

Although not required by Rule 26(b)(1), it is encouraged by many observers including the Sedona Conference and the EDRM.  Having a specific agreement on data exchange makes standardization easier and more efficient.

III CONCLUSION

My current favorite tool, Nextpoint, takes a proactive approach to large data sets by using two of the tools I mention above, Mapping and Analytics, in a process they call Early Data Assessment. EDA allows legal teams to sift through a mountain of electronic data to find potential evidence, thus reducing the data size and providing valuable insights which allow informed decisions for a more productive document review.


[1] A zettabyte is equal to 1,000 exabytes, or 1 trillion gigabytes.

  1. Megabytes = 1 Gigabyte

1000 Gigabytes = 1 Terabyte.
1000 Terabytes = 1 Petabyte.
1000 Petabytes = 1 Exabyte.
1000 Exabytes = 1 Zettabyte.


EDRM Co-Founder George Socha Talks About Upcoming Conference

March 24, 2013

Ediscovery RedZone regulars Browning Marean and Tom O’Connor speak with EDRM co-founder George Socha about the upcoming conference “EDiscovery Skills for Medium and Small Cases” at the University of Florida Levin College of Law. George describes how the groundbreaking 1 ½ day conference, co-sponsored by EDRM and the International Center for Automated Research (ICAIR) at Levin College of Law, will be both live and streamed on line, offered at an extremely low price and feature practical “how to” demos of ED software.

Browning and Tom will join Atty Bruce Olson live via video feed from the ABA TechShow to do the opening day presentation while DWR Ediscovery expert Barry O’Melia will be speaking on site at the conference about search technology while. Digital WarRoom is also contributing free licenses to it’s award winning software to all on site attendees.

Get all the information on how to register by listening to George Socha in the EDiscovery Red Zone at http://www.digitalwarroom.com/redzone/