class documentation

Finds copy-pasted lines of code in a project.

Method __init__ Undocumented
Method append_stream Append a file to search for similarities.
Method combine_mapreduce_data Reduces and recombines data into a format that we can report on.
Method get_map_data Returns the data we can use for a map/reduce process.
Method run Start looking for similarities and display results on stdout.
Instance Variable linesets Undocumented
Instance Variable namespace Undocumented
Method _compute_sims Compute similarities in appended files.
Method _display_sims Display computed similarities on stdout.
Method _find_common Find similarities in the two given linesets.
Method _get_similarity_report Create a report from similarities.
Method _iter_sims Iterate on similarities among all files, by making a Cartesian product.
def __init__(self, min_lines: int = DEFAULT_MIN_SIMILARITY_LINE, ignore_comments: bool = False, ignore_docstrings: bool = False, ignore_imports: bool = False, ignore_signatures: bool = False): (source)

Undocumented

def append_stream(self, streamid: str, stream: STREAM_TYPES, encoding: str|None = None): (source)

Append a file to search for similarities.

def combine_mapreduce_data(self, linesets_collection: list[list[LineSet]]): (source)

Reduces and recombines data into a format that we can report on. The partner function of get_map_data()

def get_map_data(self) -> list[LineSet]: (source)

Returns the data we can use for a map/reduce process. In this case we are returning this instance's Linesets, that is all file information that will later be used for vectorisation.

def run(self): (source)

Start looking for similarities and display results on stdout.

linesets = (source)

Undocumented

namespace = (source)

Undocumented

def _compute_sims(self) -> list[tuple[int, set[LinesChunkLimits_T]]]: (source)

Compute similarities in appended files.

def _display_sims(self, similarities: list[tuple[int, set[LinesChunkLimits_T]]]): (source)

Display computed similarities on stdout.

def _find_common(self, lineset1: LineSet, lineset2: LineSet) -> Generator[Commonality, None, None]: (source)

Find similarities in the two given linesets. This the core of the algorithm. The idea is to compute the hashes of a minimal number of successive lines of each lineset and then compare the hashes. Every match of such comparison is stored in a dict that links the couple of starting indices in both linesets to the couple of corresponding starting and ending lines in both files. Last regroups all successive couples in a bigger one. It allows to take into account common chunk of lines that have more than the minimal number of successive lines required.

def _get_similarity_report(self, similarities: list[tuple[int, set[LinesChunkLimits_T]]]) -> str: (source)

Create a report from similarities.

def _iter_sims(self) -> Generator[Commonality, None, None]: (source)

Iterate on similarities among all files, by making a Cartesian product.