Projects

Machine learning model for predicting startup success

Type Machine learning algorithm
TECHNOLOGIES NetworkX, pandas, PyTorch, PyTorch Geometric
AREAS OF EXPERTISE Finance, Analytics
TEAM 2 ML developers, 1 Project manager

An ML model for the preliminary assessment of startups seeking investment, required by the client as an auxiliary tool for making funding decisions within the scope of their company’s activities. As initial information, we received a list of deals concluded between companies (or their CEOs) and investors (or investment funds) starting from 2006. We also received a list of companies that have gone public through IPOs in the last 20 years.

We generated a dynamic graph in which companies and investors were represented as nodes, and contracts concluded between them were represented as edges of the graph. Next, using a neural network, we train the vector representation of companies (commonly known as embeddings), taking into account their history: the emergence of new edges (new deals) and the ‘neighbors’ of companies in the graph.

The embedding of a graph vertex (i.e., a company) is a vector of a specified dimension that is updated (trained) by performing auxiliary tasks: identifying the type of vertex (company or non-company) and predicting the emergence of new edges (deals). The core idea behind embeddings is that companies exhibiting similar behavior and interacting with a similar subset of neighbors will have similar representations in the vectors.

Challenges
  1. Data Transformation
  2. Implementation of the paper in terms of PyTorch Geometric
  3. Discrepancy in Formulas
  4. Debugging Challenges
  5. Model Training and Results
Solutions
  1. As our initial data sources, we had a set of tables with the following data: Deal history (parties involved, type of deal (IPO, MA, VC round), deal date);- Information about funds acting as a party to the deal; - Information about investors;- Information about companies - location, CEO name, CEO's level of education; the company's field of work. The data covered the period from 1997 to 2022. 

    In order to train the graph network, we needed to correctly assemble this tabular data and transform it into a graph structure, and also enrich it with data from external sources. For accurate transformation, we used the chain CSV -> Pandas Dataframe -> Networkx Graph -> PYG Graph, which allowed us to monitor the accuracy of the data transformation.

  2. During the project, we had to implement the approach described in the paper. We chose the PyTorch Geometric framework as the basis for implementation, which is tailored for writing graph neural networks and contains a sufficient number of primitives for working with them, including tools for working with dynamic graphs. The paper did not include code and data for reproducibility, only a high-level description was provided. Therefore, from a technical standpoint, our task was to implement the proposed approach in terms of PyTorch Geometric.
  3. Discrepancy in Formulas: Since the paper did not include any code, and we had to write it ourselves based on the formulas and algorithm description, we quickly noticed that the formulas in the paper were inconsistent. For example, the symbol “$\centerdot$”, usually denoting the dot product, did not allow us to implement certain functionality due to a mismatch in the dimensions of the output tensors. However, if we interpreted it as the Hadamard product, which uses the symbol “@$\times$”, everything fell into place and worked. The paper did not provide any explanations for what this operator was. Therefore, each time we encountered a similar issue, we conducted additional research.
  4. Verifying the validity of the graph also posed a certain difficulty due to the large amount of data. For the purpose of ensuring the correct transformation of the data, we took companies that were known to the client and logged their dynamics in various aspects during training (list of deals, company parameters, graphical representation with closest neighbors, etc.). Overall, this approach proved to be effective as it allowed us to identify several serious bugs.
  5. As of 2022, the resulting graph had 200,000 nodes (funds, companies, investors, CEOs) and 1 million edges. For training, we used a GCP instance with a T4 GPU and 32 GB RAM. Since the model trained on the graph was too large to fit into the GPU memory, some of its individual parts were moved to the GPU during training. The total training time for 100 epochs took about 20 hours. According to the training results, our trained model achieved an AP@10 of 23%, which is 3% higher than stated in the paper.
Results

Results

Experiments were conducted on real datasets. The results obtained using the proposed model surpass the most current baseline metrics and are 1.94 times better than the performance of real investors. The best prediction results were achieved for startups in the fields of IT and healthcare.

Business Value

  1. Acceleration and improvement of the quality of startup selection by investment funds;
  2. The ability to invest in startups with greater profitability;
  3. The ability to conduct initial startup screening in an automated manner;
  4. Reduction in the time specialists spend on the analysis and selection of startups.
AR Office application
AR Office application
Market platform for the people interested in forex, crypto and other markets
Market platform for the people interested in forex, crypto and other markets