Downloading identical files from multiple swarms/torrents

izomiac · June 30, 2006

Currently I'm downloading an unlicensed anime and an interesting feature idea just occured to me. You see, there are several torrents of the series which each have files that are the same as their counterparts in the other torrent (releases from a certain fansubbing group). Torrent A might have the first couple episodes while Torrent B might have the entire season. They also have different naming schemes. My idea is for a feature that will allow a user to essentially download both torrents simultaneously to the same location. This should provide a much larger number of peers and thus a faster downloading speed. I'm also sure it would be useful in other situations as well (in case you don't like my example for ethical/legal reasons). In case I haven't explained this to the point that someone else can understand it, here's an example.

Torrent A has 3 files: 1.avi, 2.avi, 3.avi

Torrent B has 10 files: a.avi, b.avi, etc.

Torrent C has 4 files: 3.avi, 4.avi, 5.avi, 6.avi

Each torrent has a different group of seeders/peers.

My idea is for a feature that would allow you to treat these three torrents as one download. You tell BitComet (or it determines based on hashes/filesize) which files are identical (in this case 3.avi in A & C and c.avi in B) and it downloads them to a single location without redundantly requesting blocks from both swarms.

Const2k · June 30, 2006

We have almost the same idea being discussed nearby. So I posted my answer there and suggest everyone to post there too.

izomiac · July 3, 2006

While I can appreciate the effort at consolidation, it's not really the same thing IMHO (similar though, I looked around but didn't realize the similarities). That thread requests that BitComet prevent you from accidently downloading the same file twice. This one talks about a way to increase download speed for certain situations. That said, a lot of code could probably be shared between the two features if they both got added because both features require identical file identification between different torrents.

In response to your discussion of block distribution in the other thread...

To sum up, it's impossible right now to compare files within different .torrent-s as .torrent's structure is based on using pieces's (not files') hashing.

That certainly would increase the difficulty of either feature. However, a solution would be to reconstruct blocks from one torrent when you have two consecutive blocks from the other.

Example: You just downloaded blocks 123 & 124 from torrent A.

Block 123 & 124 in torrent A is of file_a.txt

In torrent B that section (both blocks from A) of the file is divided among blocks 456, 457, & 458.

Therefore, by using offsets, you should be able to reconstruct & hash check block 457 in torrent B, and (more complicatedly) the end of block 456 (so you can stop that block download early). Since you have block 457 the progress of torrent B just progressed without needing to obtain that block from torrent B (although my suggestion would be to treat them as one download and construct files from blocks from both torrents). It may not be 100% effecient, but only downloading from one swarm (current functionallity) is more like 50% comparitive effeciency.

A possible approach would be to construct the file with the data you get as you get it. Check what you have before requesting blocks.

Example: (X means downloaded, - means not)

File 1:

---------------------------

Hashes (A then B respectively):

|---1---|---2---|---3---|-

---|----6----|----7----|---

Block 6 & 7 of B are recieved.

---XXXXXXXXXXXXXXXXXXXXX---

|---1---|---2---|---3---|-

---|----6----|----7----|---

Block 2 of A no longer is needed, nor is the end of block 1, or block 3 if the first part of block 8 of B is recieved first. Blocks priority is also affected by completed percent (block 8 of B is favored over block 3 of A). More complicated than I originally envisioned, but it still seems possible and helpful.

Const2k · July 3, 2006

Hi there.

(my post seemed to be over-overquoted, so I've decided to use different colors for my old text (blue) and your text (light brown) instead of quotes and comments. Sorry for inconvenience.)

When I've said "We have almost the same idea being discussed nearby" I meant just the same thing you said later: "both features require identical file identification between different torrents". :)

Well, there can be lots of features added to BitTorrent clients & trackers based on the mechanism of identifying single files in different .torrent-s. And that mechanism is what we all are talking about. Now let me be a little more personal :)

However, a solution would be to reconstruct blocks from one torrent when you have two consecutive blocks from the other.

Example: You just downloaded blocks 123 & 124 from torrent A.

Block 123 & 124 in torrent A is of file_a.txt

I assume file_a.txt is divided between only two pieces (let's use BitComet terms), and then it is 100% completed.

In torrent B that section (both blocks from A) of the file is divided among blocks 456, 457, & 458.

Therefore, by using offsets, you should be able to reconstruct & hash check block 457 in torrent B, and (more complicatedly) the end of block 456 (so you can stop that block download early).

Piece #457 must be OK, but #456 can't be verified as of now as we need its first part to be able to hash-check #456. We have no idea when we'll get this part, so we'll have to store its last part in cache without being able to send it out or use in any other way.

Since you have block 457 the progress of torrent B just progressed without needing to obtain that block from torrent B (although my suggestion would be to treat them as one download and construct files from blocks from both torrents).

No objections. Just some useless - as of now and indefinitely from now on - data (last part of #456 & first part of #458) in cache.

It may not be 100% effecient, but only downloading from one swarm (current functionallity) is more like 50% comparitive effeciency.

...assuming both swarms have equal number of seeds & leechers, equal U/D speeds and equal distribution of pieces between peers. Not of much relevance to real life IMHO... Nevermind.

The point is trade-off between CPU & memory usage and overall change in traffic. Each downloaded piece (and there may - and, most likely, will - be hundreds of them) from one torrent has to be compared... with what? Here we are...

How can BC identify identical files in different torrents?

THIS seems to be a problem to solve first. (Well, let's skip it for a while. Let's assume WE tell BitComet that "these two" files are identical. And let's even assume we're right (not too obvious, to be precise...)).

A possible approach would be to construct the file with the data you get as you get it.

And how would we verify that received data is correct? And how much of which torrent's file data drop if there are (and there will be) errors?

Check what you have before requesting blocks.

Every, say, 1s? Each of xxx of (in-)complete pieces of xxx MB torrent of one torrent with yyy (in-)complete pieces from another yyy MB (one?) ? (I still assume files within x and y torrents ARE identical)

Example: (X means downloaded, - means not)

File 1:

---------------------------

Hashes (A then B respectively):

|---1---|---2---|---3---|-

---|----6----|----7----|---

Block 6 & 7 of B are recieved.

---XXXXXXXXXXXXXXXXXXXXX---

|---1---|---2---|---3---|-

---|----6----|----7----|---

Block 2 of A no longer is needed, nor is the end of block 1, or block 3 if the first part of block 8 of B is recieved first. Blocks priority is also affected by completed percent (block 8 of B is favored over block 3 of A). More complicated than I originally envisioned, but it still seems possible and helpful.

So here's a question: how much will it cost in term of resources? What CPU should I have, how much memory (=what cache) should be sufficient for this to work 24/7? Even if everything above will be written in BitComet's code and will be working flawlessly (they say, BTW, "the simplier, the better (=more reliable)") ...

On the contrary, if this isn't supposed to work 24/7, how often will there be need in using this? Isn't it possible that this feature will become too "heavy" to be widely used?

It's not useless. It can be used to "repair" some "dead" torrents at cost of high CPU usage. This will lead to (partially) duplicate "alive" torrents. It can be used to receive files from more peers at cost of high CPU/RAM/HDD usage.

What should private trackers' users do if one file is in private & public torrent?

It seems to me that number of questions exceeds number of answers... So, my overall rating is "very questionable".

You ask me what I'd suggest myself? You already know it :) Add checksums of single files into .torrent first. And then anyone will be able to post their suggestions in appropriate topic ;)

E.g. it'll be possible to find & download single needed file in all BitTorrent community...

Anymore ideas? I'm just wondering whether programmers can hear their users...

izomiac · July 4, 2006

It looks like my explaination leaves a bit to be desired... Rather than address all your questions with a massive quote/response format I'll try to address the questions and explain the exact procedure a little better.

It's not useless. It can be used to "repair" some "dead" torrents at cost of high CPU usage. This will lead to (partially) duplicate "alive" torrents. It can be used to receive files from more peers at cost of high CPU/RAM/HDD usage.

Other than "high" ("slightly higher" IMHO) I completely agree with this summary. Although any case where you can increase the number of peers/seeders your download speed should improve (as well as torrent health in both swarms). Honestly reporting upload amounts to each tracker might be tricky though.

It seems to me that number of questions exceeds number of answers... So, my overall rating is "very questionable".

Fair enough. If it were just this feature that required so much work then I wouldn't even give it that. (When I thought of it I was under the impression that .torrents hashed each file seperately.) But since other features seem like they could also use the core functionallity then perhaps it's still worth discussion.

For the private/public tracker situation... Personally I'd say just let anyone with access to the public tracker use this feature (i.e. ignore private flag). But, since private tracker operators seem to be a bit obsesed with control (IMHO), I suppose a safer solution would be to not allow this feature for torrents with the private flag.

Checksums of individual files would be a nice feature, and probably easy to implement. But, if BitComet was the only client that did so, then it would be almost rendered useless IMHO. Afterall, it's only a useful feature if the .torrent file has it, and AFAIK most .torrents aren't made with BitComet. While a simple and elegant solution, it doesn't seem practical... :-/

While computing hashes may be CPU intensive, I'm mostly considered it "free". The reason is because with my 3.06 GHz P4 I can compute somewhere around 200,000 hashes a second (most types). Now, if you have a piece that's larger than the 8 or so bytes I test (crypto brute forcing), it'll probably take longer that a 1/200,000th of a second. Still though, if the piece is still in RAM it shouldn't consume a noticable chunk of CPU power. Computing two or three times as many hashes in a download shouldn't increase BitComet's CPU requirements.

RAM usage & Harddisk speed are other factors to consider. But with the method I describe there shouldn't need to be any more data stored in RAM (just move it to the harddisk when there's a chance, since the data is a piece from the other torrent). The extra harddisk activity generated by the overhead of such a feature would still be far less than downloading two torrents, which is what my feature idea is (essentially). If the feature isn't needed then it isn't used (when two torrents are added with files with the same extension and size then ask if they are the same, if so then activate it). Identifying if files are identical without downloading them is going to be near (if not actually) impossible for the reasons you mention in the other thread. Network speed is much slower than about anything else in most cases, so sacrificing a little CPU time to increase download speed should still make the torrent complete more quickly (more peers/seeders should help a download in most situations).

Here's a more descriptive version of what I'm picturing since I don't think it will use too much additional CPU/RAM/HD Activity...

A section of file.ext: (this is somewhere in the middle of the file)

------------------------------ (30 bytes of placeholders or space that hasn't been written)

For the sake of simplicity, assume this section is divided into a whole number of complete blocks in both torrents.

Torrent A has this section evenly divided into pieces 566, 567, and 568 (10 bytes each).

Torrent B has it divided into pieces 1001 and 1002 (15 bytes each).

Piece 567 of torrent A has arrived.

Piece 567 of A is verified using A's hash table.

New data section:

----------XXXXXXXXXX---------- (30 bytes, the middle 10 are now known)

This is written to file.ext on the harddisk or stored in the disk cache (normal procedure).

When a new piece is requested, blocks 1001 and 1002 of B are a lower priority than pieces 566 or 568 of A because they are partially known (and downloading them in their entirety would be somewhat unnecessary).

Said priority check occurs when a new piece can be requested from torrent B. BitComet would have 3 piece states instead of two. (I have it, I need it, and I sorta need it instead of just the first two options).

Due to random chance block 1002 is received first and is verified by torrent B's hash table.

New data section:

----------XXXXXXXXXXXXXXXXXXXX (30 bytes, the last 20 are known)

BitComet later can request a piece from torrent A (some piece outside of this section is downloaded or whatever).

The status of piece 568 is checked with A's hash table.

If it passes, then piece 568 is considered done and can be shared.

If it fails then BitComet displays a "you were wrong" type error and either stops the download or copies what it has to a new file and writes this new piece (each torrent now has its own).

I'm not sure of BitComet's exact operating method, but this seems like it's possible (and not too resource intensive). This would only be used if two tasks seem to have some of the same files and the user identifies the files as identical when prompted. This would be useful for less popular torrents where peers/seeders are divided among different torrents with identical files. I know it would be useful for fansubs and other episodic video files. I would imagine that it could be used with situations like the original poster in the other thread implied...

jessie · August 1, 2006

Suppose that we have realized this function to download identical Torrent filesA 3.avi and Torrent filesC 3.avi to a single location without redundantly requesting blocks from both swarms in one file.Here is a new question: all files named 3.avi will be saved in one file according to this function if there are two different files both have 3.avi(for example:movieA and movieB). So at the time we add this function, we should add the function to distinguish different files(movieA and movieB).It's hard to do. But we will take it into account.

Thanks! :)

Sign In

Downloading identical files from multiple swarms/torrents

Recommended Posts

izomiac

Link to comment

Share on other sites

Const2k

Link to comment

Share on other sites

izomiac

Link to comment

Share on other sites

Const2k

Link to comment

Share on other sites

izomiac

Link to comment

Share on other sites

jessie

Link to comment

Share on other sites

Please sign in to comment

Browse

Activity