[00:05] *** julientm has quit IRC (Remote host closed the connection) [00:06] *** julientm has joined #archiveteam-ot [00:08] *** julientm has quit IRC (Remote host closed the connection) [00:11] *** julientm has joined #archiveteam-ot [00:29] *** julientm has quit IRC (Remote host closed the connection) [00:32] *** julientm has joined #archiveteam-ot [01:51] *** killsushi has joined #archiveteam-ot [02:06] *** killsushi has quit IRC (Quit: Leaving) [02:34] *** julientm has quit IRC (Read error: Operation timed out) [02:44] *** SimpBrain has quit IRC (Remote host closed the connection) [02:45] *** SimpBrain has joined #archiveteam-ot [02:56] *** m007a83 has quit IRC (Read error: Connection reset by peer) [03:00] *** BlueMax has joined #archiveteam-ot [03:01] *** m007a83 has joined #archiveteam-ot [03:42] *** wp494 has quit IRC (Read error: Operation timed out) [03:44] *** wp494 has joined #archiveteam-ot [03:56] *** VADemon has quit IRC (Quit: left4dead) [04:07] *** Despatche has joined #archiveteam-ot [04:45] *** deevious1 has joined #archiveteam-ot [04:46] *** deevious has quit IRC (Ping timeout: 252 seconds) [04:46] *** deevious1 is now known as deevious [05:00] *** odemg has quit IRC (Ping timeout: 615 seconds) [05:07] *** odemg has joined #archiveteam-ot [05:16] *** m007a83 has quit IRC (Quit: Fuck you Comcast) [05:16] *** m007a83 has joined #archiveteam-ot [06:16] *** nataraj has joined #archiveteam-ot [06:28] *** m007a83 has quit IRC (Read error: Connection reset by peer) [06:30] *** m007a83 has joined #archiveteam-ot [06:39] *** Despatche has quit IRC (Remote host closed the connection) [06:41] *** Despatche has joined #archiveteam-ot [06:42] *** Despatche has quit IRC (Read error: Connection reset by peer) [06:42] *** Despatche has joined #archiveteam-ot [06:50] *** Despatche has quit IRC (Connection reset by deer) [06:53] *** deevious1 has joined #archiveteam-ot [06:55] *** deevious has quit IRC (Ping timeout: 252 seconds) [06:55] *** deevious1 is now known as deevious [07:45] *** Stiletto has joined #archiveteam-ot [07:45] *** Stilett0 has quit IRC (Ping timeout: 252 seconds) [08:20] *** deevious has quit IRC (Ping timeout: 252 seconds) [08:58] *** deevious has joined #archiveteam-ot [09:03] *** nataraj has quit IRC (Read error: Operation timed out) [09:21] *** nataraj has joined #archiveteam-ot [09:28] *** nataraj has quit IRC (Read error: Connection reset by peer) [09:29] *** nataraj has joined #archiveteam-ot [09:47] *** SimpBrain has quit IRC (Read error: Operation timed out) [09:55] *** SimpBrain has joined #archiveteam-ot [09:58] *** S1mpbrain has joined #archiveteam-ot [09:58] *** SimpBrain has quit IRC (Read error: Connection reset by peer) [10:04] *** nataraj has quit IRC (Ping timeout: 268 seconds) [11:32] *** BlueMax has quit IRC (Quit: Leaving) [12:28] *** nataraj has joined #archiveteam-ot [12:31] *** nataraj has quit IRC (Read error: Operation timed out) [12:32] *** kiska1 has quit IRC (Read error: Operation timed out) [12:35] *** kiska1 has joined #archiveteam-ot [12:41] *** wp494 has quit IRC (Read error: Operation timed out) [12:42] *** wp494 has joined #archiveteam-ot [13:29] *** VADemon has joined #archiveteam-ot [13:36] *** icedice has joined #archiveteam-ot [14:05] *** S1mpbrain has quit IRC (Read error: Connection reset by peer) [14:12] *** SimpBrain has joined #archiveteam-ot [14:16] *** deevious has quit IRC (Quit: deevious) [14:54] *** deevious has joined #archiveteam-ot [14:54] *** LFlare has quit IRC (Ping timeout: 268 seconds) [15:03] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [15:04] *** Stiletto has joined #archiveteam-ot [15:04] *** deevious has quit IRC (Quit: deevious) [15:10] *** nataraj has joined #archiveteam-ot [15:23] Well, that didn't last long... transfer.sh's down again already. [15:37] *** deevious has joined #archiveteam-ot [16:05] *** deevious has quit IRC (Remote host closed the connection) [17:42] *** julientm has joined #archiveteam-ot [18:24] *** lindalap has joined #archiveteam-ot [18:25] Reddit just banned r/watchpeopledie https://old.reddit.com/r/subredditcancer/comments/b1hic4/reddit_just_banned_rwatchpeopledie/ [18:25] Also r/wpdtalk [18:26] I believe there was a grab some months ago, so it's partially archived at least (up until March 2018 at least, I think) [18:30] And yeah, reddit is doing its censorship as a PR move. I've not seen glorification of violence since March 2018 on that subreddit, even few hours ago. [18:34] I wonder how much of it is also in the Pushshift dataset since it got quarantined a while ago. [18:36] The video from Christchurch was posted there and then removed by the Reddit admins, by the way. [18:36] (Yes, admins, not the sub's mods.) [18:40] IIRC, the subreddit was banned for the same reason ("glorification") in March 2018, then subsequently unbanned after the reddit community petitioned they found it to be an educational place for the state between life and death. Stuff like some depressed user not wanting to commit suicide anymore, or a firefighter finding the subreddit useful in their job. [18:41] Yeah, that sounds right, although I don't remember it. [18:42] Ah yeah, that was due to a Vice article, right. [18:44] Might've been, I don't remember exactly. [18:44] Hmm, doesn't look good regarding Pushshift: http://redditsearch.io/?term=&dataviz=false&aggs=false&subreddits=watchpeopledie&searchtype=posts&search=true&start=1549997068&end=1552675468&size=100 [18:44] Unless it got blocked there already. [18:45] Last results there are from around half a year ago. That might be when it got quarantined. [18:46] Can ArchiveBot handle quarantines, like the 18+ warning? [18:46] It seems I don't have to login to view quarantined subreddits [18:46] I've read it used to be that quarantined subreddits needed to be logged in to read [18:48] No, it needs some cookie, but ArchiveBot doesn't set it. So no, it can't grab quarantined subs. [18:48] The cookie needed is this (or at least it was about half a year ago): _options=%7B%22pref_quarantine_optin%22%3A%20true%7D [18:49] r/gore banned too, huh [18:49] 1 hour ago [18:49] * lindalap tried to find a quarantined subreddit to check what the cookie is [18:51] Seems simple enough. Cookie name "_options" with value "%7B%22pref_quarantine_optin%22%3A%20true%7D". [18:51] We should collect a list of subs so we can check for the status of all when Reddit does something like this. [18:51] So that didn't change since October. :-) [19:27] *** Despatche has joined #archiveteam-ot [19:45] *** Stilett0 has joined #archiveteam-ot [19:47] *** Stiletto has quit IRC (Read error: Operation timed out) [20:28] *** nataraj has quit IRC (Quit: Konversation terminated!) [20:28] *** nataraj has joined #archiveteam-ot [20:45] *** BlueMax has joined #archiveteam-ot [21:01] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [21:31] *** nataraj has quit IRC (Read error: Operation timed out) [21:36] *** icedice2 has joined #archiveteam-ot [21:41] *** icedice has quit IRC (Read error: Operation timed out) [21:45] *** wp494 has quit IRC (Ping timeout: 615 seconds) [21:45] *** wp494 has joined #archiveteam-ot [22:48] *** dw_ has joined #archiveteam-ot [22:48] hey. i've been making a new backup of thepiratebay. what is the best way to share it? [22:50] the old ones are out of date and the torrent description fields had incorrect encoding. ideally want to make a big torrent of the raw html in case my parser is also broken somehow [22:51] Doesn't TPB provide dumps somewhere? I think it's hidden pretty well though. [22:53] they have truncated title and most metadata missing [22:53] no description/comments fields [22:53] Ah, I see. [22:54] dw_: please get it as WARCs!! [22:54] WARCs are awesome <3 [22:54] That, and how large is it? [22:54] arkiver: its up to 60gb of gzipped html. i started this for fun, so its literally just a xargs+wget job :P [22:55] i can probably convert to warc but headers and majority of timestamps are gone [22:55] Yeah, don't do that. [22:55] plan was to recompress it into quarterly 7zips, solid compression helps a /ton/ [22:55] i think solid zips it'll be around 8gb total [23:01] dw_: Can you share some details on how you did the retrieval? Anything special, or just https://thepiratebay.org/torrent/$id/ iterating over all IDs? [23:02] Ah, there are sitemaps also. I wonder how complete those are. [23:03] Nevermind, the torrent sitemaps are 404s. [23:05] Also, for the record, the dumps I was referring to before are these I think: https://thepiratebay.org/static/dump/ [23:05] yep, those dumps are woefully incomplete :) useful for cross-checking, but i haven't got that far yet [23:06] initial scrape used urls from archive.org, then some random tpb dump i found, then the most recent 'full metadata' backup i could find (dec 2017) [23:06] now it is brute forcing backwards from current high id, very slowly [23:07] also want to combine this dump with the last backup and possibly tpb'd own static dumps somehow, and add a 'deleted_by' field, showing when a torrent was removed [23:07] found tens of thousands removed since last backup already, i think most of it porn [23:08] tpb ids are spaced anywhere between 1 and 1500 apart. no clue what they're doing, it looks like an anti-scraping measure, i doubt they get that much spam [23:08] I wouldn't be surprised at all of 99 % of the uploads were spam. [23:08] You mentioned "very slowly": did you encounter any rate limits or similar? [23:09] yeah, we're talking about id jumps of 50 or 100 within seconds, it seems.. [23:09] if 99 % of* [23:09] i've been bumping concurrency up gently over time, currently its at 25, feels like it could take a lot higher but that might piss off cloudflare or tpb [23:10] Any idea how many requests per second that translates into? [23:10] its wget :) so ssl session renegotiation on every connection, too late to fix [23:11] maybe getting ~20-25/sec [23:11] 1.3MiB/sec of mostly 404s [23:11] So we're talking ~14 days of scanning. [23:11] thats almost exactly right [23:12] Does TPB use Cloudflare? Doesn't look like it from the TLS certificate. [23:12] ip addresses are def cloudflare [23:13] Yeah, true. [23:13] They probably gave their COMODO cert to Cloudflare then. [23:14] Or cert keys, to be precise. [23:14] im really surprised it's possible to crawl so fast to be honset [23:14] maybe they're on some special config, its probably one of cloudflare's biggest sites too [23:15] *** Despatche has quit IRC (Remote host closed the connection) [23:15] One semi-annoying thing about retrieving https://thepiratebay.org/torrent/$id/ is that external links will be broken since they always have a name/slug after the ID. [23:15] Works fine for what you're doing, but I'm thinking about WARCs at the moment. [23:15] i think its in link rel="canonical" [23:15] *** Despatche has joined #archiveteam-ot [23:16] Nope [23:16] bah, nope :/ [23:16] can probably figure out the slug algorithm easily enough :P [23:16] Possibly. Spaces and + get replaced by underscores. [23:17] html entities escaped, then ampersands stripped too by the looks of it [23:17] really easy to cross-check this once the dump is complete.. [23:17] pull out examples of every char in title, and longest titles, search for it, then compare [23:17] Another option would be to go through the uploader's page. [23:18] its limited to 30ish pages, and only for recent results. their search by author cuts off at some point in history [23:18] Ah right. [23:18] Same as "recent torrents" then. [23:18] Meh [23:19] And categories and probably everything. [23:19] enumeration options are slim [23:20] archive.org's CDX server or whatever it's called had hardly any /torrent/ urls [23:20] doesn't seem to be any way to go from their public dump to a torrent without searching every single one of them, or some weird thing where you combine many approach like category/author search, but it doesn't seem likely to be as complete than just brute force [23:21] the only thing i didnt check was common crawl, it should be possible to extract urls from it too, but again completeness [23:22] Ah, comment pagination happens through JS also. Meh [23:23] i also want to brute force all the missing ids from the last backup [23:23] or at least cross check that backup against what their official dump claims [23:23] their id scheme is sufficiently weird that a scrape could easily miss stuff [23:24] *** ColdIce has joined #archiveteam-ot [23:25] I can see a recursive crawl being incomplete, but a full ID bruteforce should be fine, no? [23:26] brute force so far is just from max(last backup..max(top torrent on recent torrents page) [23:27] By the way, the ampersand of HTML entities isn't simply stripped in URLs but replaced with an underscore. Same for the semicolon. So "Väter" becomes "V_auml_ter". [23:27] Doing this reliably will be very tricky. [23:27] yeah, looks like something like do_underscore_replace(htmlescapechars($title)) or something in php [23:28] pretty sure its truncated too [23:28] It might be possible to search for the title, but that probably fails in some situations as well. [23:28] 5 million search requests :( they'd almost certainly ban :P [23:29] A ban has never kept us away. [23:29] :-) [23:29] And yes, titles are truncated, but not only in URLs. [23:29] hehe [23:30] Random example from the top list: https://thepiratebay.org/torrent/29943379 [23:32] i wondered about efficient approaches to scanning their id space [23:32] And we arrived at "let's make millions of search queries". :-) [23:32] there have clearly been manual sequence resets at some point, e.g. there is torrent 9999999 [23:33] hoping to draw a graph of torrent count by week and look for any obvious dips (assuming they're discernible from downtimes), that might help spot missing id ranges [23:33] Yeah, I remember also something about them bumping the IDs at some point to make it seem like there were much more torrents or something. Might've been a different site though.