Jim Scott

Jim Scott is global head of developer relations, Python and Accelerated Computing, at NVIDIA ( 

Over his career, Scott has held positions running operations, engineering, architecture, and QA teams in the big data, regulatory, digital advertising, retail analytics, IoT, financial services, manufacturing, healthcare, chemicals, and geographical information systems industries. He has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

Articles by Jim Scott

Within the realm of data science, deep learning frameworks are predominantly delivered via software found in the Python ecosystem. When looking at the options in the space, it may appear to some as a battle for supremacy, or for one to reign supreme, but the reality is that for a variety of reasons people have their favorites. Calling this a "war" is perhaps being a bit overdramatic.

Posted September 13, 2022

Python has become the default language of solving complex data problems due to its ease of use, plethora of domain-specific software libraries, and stellar community and ecosystem. All of these things have led to the emergence of even more new and easier-to-use frameworks that enable users to scale their Python code.

Posted May 16, 2022

In the world of machine learning (ML), there are a few very important processes which are critical to anyone in the ML space. The first is making sure the data used in machine learning is clean. This topic gets talked about a lot so, while it is very important, let's skip this and move on to the next step. Assuming a clean and complete dataset already exists, the user then proceeds to start training models.

Posted April 01, 2022

With the combination of the PyData community and commercial software offerings (especially those accelerating Python's foundational components), as well as all the startups building solutions, there is a lot of momentum propelling Python forward.

Posted December 15, 2021

The amount of data associated with networks is massive, especially when compared to the percentage of traffic that is actually malicious. There is just too much data to analyze in a single day, and this problem is compounded on a daily basis. Threat tactics are constantly changing, with events occurring at a higher frequency, which is forcing the network security industry to prepare for and react to any questionable situation. To make matters worse, cybersecurity specialists are in very high demand, and there is a limited pool of talent from which to draw.

Posted September 27, 2021

Network security logs are a ubiquitous record of system runtime states and messages of system activities and events. Parsing logs with regular expressions is the most widely utilized method available for network log analysis. Providing a toolset powered by NLP to perform log parsing is a game changer in the critical and time-sensitive area of cybersecurity.

Posted May 26, 2021

Securing information systems and data is a foundation for any organization. Detection of insider threats can be a considerable challenge for threat detection systems and security analysts. This is due to the difficulty of determining non-normal actions from internal system behavior data.

Posted April 05, 2021

The tectonic technology shifts we saw with the advent of Hadoop will not be seen again for quite a long time, but don't get too comfortable because the development of new tools and technologies to support the data science space isn't slowing down any time soon.

Posted January 18, 2021

When talking about data science, most people feel as if they are in one of two camps as far as data size. The first is really small data—hundreds of megabytes to a few gigabytes. The second is gigabytes to terabytes. Notice I didn't say "big data," nor did I say "petabytes." The source datasets may start at petabyte-scale, but keep in mind that data is often very raw, and most of it is ignored.

Posted September 11, 2020

The pandemic is revealing gaps in our critical infrastructure security, supply chain fragility, and utilization of modern technologies to mitigate and recover communities globally. Although advanced technology platforms have been used by large international corporations, the pandemic is exposing the fact that emergency management and public health agencies are behind the curve or underutilizing data science, open source software, and high-performance computing resources.

Posted May 28, 2020

As data sizes have grown over the last decade, so has the amount of time it takes to run ETL processes to support the myriad downstream workloads. A decade ago, most people were only thinking about making their KPI dashboards faster. As time rolled forward, they started to think about getting more intelligent analytics out of their data, and the data sizes quickly grew from gigabytes to terabytes.

Posted May 18, 2020

Accelerating the Data Science Ecosystem

Posted December 16, 2019

It is well-known that data scientists spend about 90% of their time performing data logistics-related tasks. Anything that a data scientist can do to reduce it is a good use of their time, and a benefit to the organization as a whole. Enter RAPIDS—a data science framework offering support for executing an end-to-end data science pipeline entirely on the GPU.

Posted September 03, 2019

Apache Airflow is turning heads these days. It integrates with many different systems and it is quickly becoming as full-featured as anything that has been around for workflow management over the last 30 years. This is predominantly attributable to the hundreds of operators for tasks such as executing Bash scripts, executing Hadoop jobs, and querying data sources with SQL.

Posted May 16, 2019

Kubeflow is a workflow tool which prides itself on making machine learning workflows simple to build, scalable, and portable. It provides graphical end-user tools to set up and define the steps in a pipeline. Most importantly, as data scientists build out their use cases, they add more and more steps and, when using Kubeflow, they end up with a documented, repeatable process.

Posted April 24, 2019

The Importance of Data for Applications and AI

Posted January 09, 2019

Thanks to the dramatic uptick in GPU capabilities, gone are the days when data scientists created and ran models in a one-off manual process. This is good news because the one-off model was typically not optimized to get the best results. The bad news is that with an increase in the total number of models created—including iterations over time—the amount of data used as inputs and generated by the models quickly spirals out of control. The additional bad news is that there are a variety of complexities associated with model, data, job, and workflow management.

Posted September 26, 2018

Blockchain has been one of the most loudly trumpeted new technologies on the enterprise database scene in recent history. However, the concept of a blockchain is not really a new notion. It is more of a repackaging of existing constructs to deliver a new set of benefits to any organization leveraging it for their use cases. It provides the benefit of irrevocable proof, and it reduces friction with information exchange.

Posted May 11, 2018

Data, data, data. With the exponential data growth that has occurred in the last 15 years, a need has arisen to sift through that data to find answers to questions.

Posted March 26, 2018

In a world where new technologies are often presented to the industry as rainbows and unicorns, there is always someone in a cubicle trying to figure out how to solve business problems and just make these great new technologies work together. The truth is that all of these technologies take time to learn, and it also takes time to identify the problems that can be solved by each of them.

Posted January 05, 2018

Many people are unsure of the differences between deep learning, machine learning, and artificial intelligence. Generally speaking, and with minimal debate, it is reasonably well-accepted that artificial intelligence can most easily be categorized as that which we have not yet figured out how to solve, while machine learning is a practical application with the know-how to solve problems, such as with anomaly detectio

Posted September 20, 2017

When people talk about the next generation of applications or infrastructure, what is often echoed throughout the industry is the cloud. On the application side, the concept of "serverless" is becoming less of a pipe dream and more of a reality. The infrastructure side has already proven that it is possible to deliver the ability to pay for compute on an hourly or more granular basis.

Posted May 15, 2017

It is difficult to find someone not talking about or considering using containers to deploy and manage their enterprise applications. A container just looks like another process running on a system; a dedicated CPU and pre-allocated memory aren't required in order to run a container. The simplicity of building, deploying, and managing containers is among the reasons that containers are growing rapidly in popularity.

Posted April 07, 2017

Hadoop Fundamentals and Key Technologies in the Evolving Hadoop Ecosystem

Posted February 03, 2017

Choosing when to leverage cloud infrastructure is a topic that should not be taken lightly. There are a few issues that should be considered when debating cloud as part of a business strategy.

Posted October 04, 2016

7 Key Technologies in the Evolving Hadoop Ecosystem

Posted June 03, 2016

NoSQL databases were born out of the need to scale transactional persistence stores more efficiently. In a world where the relational database management system (RDBMS) was king, this was easier said than done.

Posted March 29, 2016

Hadoop and Its Accompanying Ecosystem Are Here to Stay

Posted January 19, 2016

It's Always a Good Time for Real Time Data Access

Posted October 13, 2015

Google white papers have inspired many great open source projects. What has been missing until now, however, has been a way of bringing these technologies together such that any data-centric organization can benefit from the capabilities of each technology across its entire data center, and in new ways not documented by any single white paper. This is called the "Zeta Architecture."

Posted May 19, 2015

In order to truly appreciate Apache Drill, it is important to understand the history of the projects in this space, as well as the design principles and the goals of its implementation.

Posted April 08, 2015