Machine learning model for predicting startup success

Type Machine learning algorithm

TECHNOLOGIES NetworkX, pandas, PyTorch, PyTorch Geometric

AREAS OF EXPERTISE Finance, Analytics

TEAM 2 ML developers, 1 Project manager

We developed a machine learning model for predicting startup success, specifically designed for assessing startups seeking investment. This machine learning model for predicting startup success is an essential tool for our client, aiding in making informed funding decisions within their company’s activities. Initially, we received a list of deals concluded between companies (or their CEOs) and investors (or investment funds) starting from 2006. We also had access to a list of companies that have gone public through IPOs in the last 20 years.

Our method involved creating a dynamic graph where companies and investors were nodes, and their contracts were graph edges. Using this model, we trained the vector representation of companies (known as embeddings), considering their historical data: the emergence of new edges (new deals) and the ‘neighbors’ of companies in the graph.

The embedding of a graph vertex (i.e., a company) is a vector of a specified dimension, updated (trained) by performing auxiliary tasks like identifying the vertex type (company or non-company) and predicting new edges (deals). The underlying principle of these embeddings is that companies with similar behaviors and interactions with similar subsets of neighbors will have comparable vector representations, which are crucial for the machine learning model to accurately predict startup success.

Challenges

Data Transformation
Implementation of the paper in terms of PyTorch Geometric
Discrepancy in Formulas
Debugging Challenges
Model Training and Results

Solutions

As our initial data sources, we had a set of tables with the following data: Deal history (parties involved, type of deal (IPO, MA, VC round), deal date);- Information about funds acting as a party to the deal; - Information about investors;- Information about companies - location, CEO name, CEO's level of education; the company's field of work. The data covered the period from 1997 to 2022. In order to train the graph network, we needed to correctly assemble this tabular data and transform it into a graph structure, and also enrich it with data from external sources. For accurate transformation, we used the chain CSV -> Pandas Dataframe -> Networkx Graph -> PYG Graph, which allowed us to monitor the accuracy of the data transformation.
During the project, we had to implement the approach described in the paper. We chose the PyTorch Geometric framework as the basis for implementation, which is tailored for writing graph neural networks and contains a sufficient number of primitives for working with them, including tools for working with dynamic graphs. The paper did not include code and data for reproducibility, only a high-level description was provided. Therefore, from a technical standpoint, our task was to implement the proposed approach in terms of PyTorch Geometric.
Discrepancy in Formulas: Since the paper did not include any code, and we had to write it ourselves based on the formulas and algorithm description, we quickly noticed that the formulas in the paper were inconsistent. For example, the symbol “$\centerdot$”, usually denoting the dot product, did not allow us to implement certain functionality due to a mismatch in the dimensions of the output tensors. However, if we interpreted it as the Hadamard product, which uses the symbol “@$\times$”, everything fell into place and worked. The paper did not provide any explanations for what this operator was. Therefore, each time we encountered a similar issue, we conducted additional research.
Verifying the validity of the graph also posed a certain difficulty due to the large amount of data. For the purpose of ensuring the correct transformation of the data, we took companies that were known to the client and logged their dynamics in various aspects during training (list of deals, company parameters, graphical representation with closest neighbors, etc.). Overall, this approach proved to be effective as it allowed us to identify several serious bugs.
As of 2022, the resulting graph had 200,000 nodes (funds, companies, investors, CEOs) and 1 million edges. For training, we used a GCP instance with a T4 GPU and 32 GB RAM. Since the model trained on the graph was too large to fit into the GPU memory, some of its individual parts were moved to the GPU during training. The total training time for 100 epochs took about 20 hours. According to the training results, our trained model achieved an AP@10 of 23%, which is 3% higher than stated in the paper.

Results

Experiments were conducted on real datasets. The results obtained using the proposed model surpass the most current baseline metrics and are 1.94 times better than the performance of real investors. The best prediction results were achieved for startups in the fields of IT and healthcare.

Business Value

Acceleration and improvement of the quality of startup selection by investment funds;
The ability to invest in startups with greater profitability;
The ability to conduct initial startup screening in an automated manner;
Reduction in the time specialists spend on the analysis and selection of startups.

Machine learning model for predicting startup success

Results

Business Value

Fill contact form

Login Register

Login Register

Password reset

Change Password