#archiveteam-ot 2019-03-15,Fri

↑back Search

Time Nickname Message
00:05 πŸ”— julientm has quit IRC (Remote host closed the connection)
00:06 πŸ”— julientm has joined #archiveteam-ot
00:08 πŸ”— julientm has quit IRC (Remote host closed the connection)
00:11 πŸ”— julientm has joined #archiveteam-ot
00:29 πŸ”— julientm has quit IRC (Remote host closed the connection)
00:32 πŸ”— julientm has joined #archiveteam-ot
01:51 πŸ”— killsushi has joined #archiveteam-ot
02:06 πŸ”— killsushi has quit IRC (Quit: Leaving)
02:34 πŸ”— julientm has quit IRC (Read error: Operation timed out)
02:44 πŸ”— SimpBrain has quit IRC (Remote host closed the connection)
02:45 πŸ”— SimpBrain has joined #archiveteam-ot
02:56 πŸ”— m007a83 has quit IRC (Read error: Connection reset by peer)
03:00 πŸ”— BlueMax has joined #archiveteam-ot
03:01 πŸ”— m007a83 has joined #archiveteam-ot
03:42 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
03:44 πŸ”— wp494 has joined #archiveteam-ot
03:56 πŸ”— VADemon has quit IRC (Quit: left4dead)
04:07 πŸ”— Despatche has joined #archiveteam-ot
04:45 πŸ”— deevious1 has joined #archiveteam-ot
04:46 πŸ”— deevious has quit IRC (Ping timeout: 252 seconds)
04:46 πŸ”— deevious1 is now known as deevious
05:00 πŸ”— odemg has quit IRC (Ping timeout: 615 seconds)
05:07 πŸ”— odemg has joined #archiveteam-ot
05:16 πŸ”— m007a83 has quit IRC (Quit: Fuck you Comcast)
05:16 πŸ”— m007a83 has joined #archiveteam-ot
06:16 πŸ”— nataraj has joined #archiveteam-ot
06:28 πŸ”— m007a83 has quit IRC (Read error: Connection reset by peer)
06:30 πŸ”— m007a83 has joined #archiveteam-ot
06:39 πŸ”— Despatche has quit IRC (Remote host closed the connection)
06:41 πŸ”— Despatche has joined #archiveteam-ot
06:42 πŸ”— Despatche has quit IRC (Read error: Connection reset by peer)
06:42 πŸ”— Despatche has joined #archiveteam-ot
06:50 πŸ”— Despatche has quit IRC (Connection reset by deer)
06:53 πŸ”— deevious1 has joined #archiveteam-ot
06:55 πŸ”— deevious has quit IRC (Ping timeout: 252 seconds)
06:55 πŸ”— deevious1 is now known as deevious
07:45 πŸ”— Stiletto has joined #archiveteam-ot
07:45 πŸ”— Stilett0 has quit IRC (Ping timeout: 252 seconds)
08:20 πŸ”— deevious has quit IRC (Ping timeout: 252 seconds)
08:58 πŸ”— deevious has joined #archiveteam-ot
09:03 πŸ”— nataraj has quit IRC (Read error: Operation timed out)
09:21 πŸ”— nataraj has joined #archiveteam-ot
09:28 πŸ”— nataraj has quit IRC (Read error: Connection reset by peer)
09:29 πŸ”— nataraj has joined #archiveteam-ot
09:47 πŸ”— SimpBrain has quit IRC (Read error: Operation timed out)
09:55 πŸ”— SimpBrain has joined #archiveteam-ot
09:58 πŸ”— S1mpbrain has joined #archiveteam-ot
09:58 πŸ”— SimpBrain has quit IRC (Read error: Connection reset by peer)
10:04 πŸ”— nataraj has quit IRC (Ping timeout: 268 seconds)
11:32 πŸ”— BlueMax has quit IRC (Quit: Leaving)
12:28 πŸ”— nataraj has joined #archiveteam-ot
12:31 πŸ”— nataraj has quit IRC (Read error: Operation timed out)
12:32 πŸ”— kiska1 has quit IRC (Read error: Operation timed out)
12:35 πŸ”— kiska1 has joined #archiveteam-ot
12:41 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
12:42 πŸ”— wp494 has joined #archiveteam-ot
13:29 πŸ”— VADemon has joined #archiveteam-ot
13:36 πŸ”— icedice has joined #archiveteam-ot
14:05 πŸ”— S1mpbrain has quit IRC (Read error: Connection reset by peer)
14:12 πŸ”— SimpBrain has joined #archiveteam-ot
14:16 πŸ”— deevious has quit IRC (Quit: deevious)
14:54 πŸ”— deevious has joined #archiveteam-ot
14:54 πŸ”— LFlare has quit IRC (Ping timeout: 268 seconds)
15:03 πŸ”— Stiletto has quit IRC (Ping timeout: 246 seconds)
15:04 πŸ”— Stiletto has joined #archiveteam-ot
15:04 πŸ”— deevious has quit IRC (Quit: deevious)
15:10 πŸ”— nataraj has joined #archiveteam-ot
15:23 πŸ”— JAA Well, that didn't last long... transfer.sh's down again already.
15:37 πŸ”— deevious has joined #archiveteam-ot
16:05 πŸ”— deevious has quit IRC (Remote host closed the connection)
17:42 πŸ”— julientm has joined #archiveteam-ot
18:24 πŸ”— lindalap has joined #archiveteam-ot
18:25 πŸ”— lindalap Reddit just banned r/watchpeopledie https://old.reddit.com/r/subredditcancer/comments/b1hic4/reddit_just_banned_rwatchpeopledie/
18:25 πŸ”— lindalap Also r/wpdtalk
18:26 πŸ”— lindalap I believe there was a grab some months ago, so it's partially archived at least (up until March 2018 at least, I think)
18:30 πŸ”— lindalap And yeah, reddit is doing its censorship as a PR move. I've not seen glorification of violence since March 2018 on that subreddit, even few hours ago.
18:34 πŸ”— JAA I wonder how much of it is also in the Pushshift dataset since it got quarantined a while ago.
18:36 πŸ”— JAA The video from Christchurch was posted there and then removed by the Reddit admins, by the way.
18:36 πŸ”— JAA (Yes, admins, not the sub's mods.)
18:40 πŸ”— lindalap IIRC, the subreddit was banned for the same reason ("glorification") in March 2018, then subsequently unbanned after the reddit community petitioned they found it to be an educational place for the state between life and death. Stuff like some depressed user not wanting to commit suicide anymore, or a firefighter finding the subreddit useful in their job.
18:41 πŸ”— JAA Yeah, that sounds right, although I don't remember it.
18:42 πŸ”— JAA Ah yeah, that was due to a Vice article, right.
18:44 πŸ”— lindalap Might've been, I don't remember exactly.
18:44 πŸ”— JAA Hmm, doesn't look good regarding Pushshift: http://redditsearch.io/?term=&dataviz=false&aggs=false&subreddits=watchpeopledie&searchtype=posts&search=true&start=1549997068&end=1552675468&size=100
18:44 πŸ”— JAA Unless it got blocked there already.
18:45 πŸ”— JAA Last results there are from around half a year ago. That might be when it got quarantined.
18:46 πŸ”— lindalap Can ArchiveBot handle quarantines, like the 18+ warning?
18:46 πŸ”— lindalap It seems I don't have to login to view quarantined subreddits
18:46 πŸ”— lindalap I've read it used to be that quarantined subreddits needed to be logged in to read
18:48 πŸ”— JAA No, it needs some cookie, but ArchiveBot doesn't set it. So no, it can't grab quarantined subs.
18:48 πŸ”— JAA The cookie needed is this (or at least it was about half a year ago): _options=%7B%22pref_quarantine_optin%22%3A%20true%7D
18:49 πŸ”— lindalap r/gore banned too, huh
18:49 πŸ”— lindalap 1 hour ago
18:49 πŸ”— * lindalap tried to find a quarantined subreddit to check what the cookie is
18:51 πŸ”— lindalap Seems simple enough. Cookie name "_options" with value "%7B%22pref_quarantine_optin%22%3A%20true%7D".
18:51 πŸ”— JAA We should collect a list of subs so we can check for the status of all when Reddit does something like this.
18:51 πŸ”— JAA So that didn't change since October. :-)
19:27 πŸ”— Despatche has joined #archiveteam-ot
19:45 πŸ”— Stilett0 has joined #archiveteam-ot
19:47 πŸ”— Stiletto has quit IRC (Read error: Operation timed out)
20:28 πŸ”— nataraj has quit IRC (Quit: Konversation terminated!)
20:28 πŸ”— nataraj has joined #archiveteam-ot
20:45 πŸ”— BlueMax has joined #archiveteam-ot
21:01 πŸ”— m007a83 has quit IRC (Ping timeout: 252 seconds)
21:31 πŸ”— nataraj has quit IRC (Read error: Operation timed out)
21:36 πŸ”— icedice2 has joined #archiveteam-ot
21:41 πŸ”— icedice has quit IRC (Read error: Operation timed out)
21:45 πŸ”— wp494 has quit IRC (Ping timeout: 615 seconds)
21:45 πŸ”— wp494 has joined #archiveteam-ot
22:48 πŸ”— dw_ has joined #archiveteam-ot
22:48 πŸ”— dw_ hey. i've been making a new backup of thepiratebay. what is the best way to share it?
22:50 πŸ”— dw_ the old ones are out of date and the torrent description fields had incorrect encoding. ideally want to make a big torrent of the raw html in case my parser is also broken somehow
22:51 πŸ”— JAA Doesn't TPB provide dumps somewhere? I think it's hidden pretty well though.
22:53 πŸ”— dw_ they have truncated title and most metadata missing
22:53 πŸ”— dw_ no description/comments fields
22:53 πŸ”— JAA Ah, I see.
22:54 πŸ”— arkiver dw_: please get it as WARCs!!
22:54 πŸ”— arkiver WARCs are awesome <3
22:54 πŸ”— JAA That, and how large is it?
22:54 πŸ”— dw_ arkiver: its up to 60gb of gzipped html. i started this for fun, so its literally just a xargs+wget job :P
22:55 πŸ”— dw_ i can probably convert to warc but headers and majority of timestamps are gone
22:55 πŸ”— JAA Yeah, don't do that.
22:55 πŸ”— dw_ plan was to recompress it into quarterly 7zips, solid compression helps a /ton/
22:55 πŸ”— dw_ i think solid zips it'll be around 8gb total
23:01 πŸ”— JAA dw_: Can you share some details on how you did the retrieval? Anything special, or just https://thepiratebay.org/torrent/$id/ iterating over all IDs?
23:02 πŸ”— JAA Ah, there are sitemaps also. I wonder how complete those are.
23:03 πŸ”— JAA Nevermind, the torrent sitemaps are 404s.
23:05 πŸ”— JAA Also, for the record, the dumps I was referring to before are these I think: https://thepiratebay.org/static/dump/
23:05 πŸ”— dw_ yep, those dumps are woefully incomplete :) useful for cross-checking, but i haven't got that far yet
23:06 πŸ”— dw_ initial scrape used urls from archive.org, then some random tpb dump i found, then the most recent 'full metadata' backup i could find (dec 2017)
23:06 πŸ”— dw_ now it is brute forcing backwards from current high id, very slowly
23:07 πŸ”— dw_ also want to combine this dump with the last backup and possibly tpb'd own static dumps somehow, and add a 'deleted_by' field, showing when a torrent was removed
23:07 πŸ”— dw_ found tens of thousands removed since last backup already, i think most of it porn
23:08 πŸ”— dw_ tpb ids are spaced anywhere between 1 and 1500 apart. no clue what they're doing, it looks like an anti-scraping measure, i doubt they get that much spam
23:08 πŸ”— JAA I wouldn't be surprised at all of 99 % of the uploads were spam.
23:08 πŸ”— JAA You mentioned "very slowly": did you encounter any rate limits or similar?
23:09 πŸ”— dw_ yeah, we're talking about id jumps of 50 or 100 within seconds, it seems..
23:09 πŸ”— JAA if 99 % of*
23:09 πŸ”— dw_ i've been bumping concurrency up gently over time, currently its at 25, feels like it could take a lot higher but that might piss off cloudflare or tpb
23:10 πŸ”— JAA Any idea how many requests per second that translates into?
23:10 πŸ”— dw_ its wget :) so ssl session renegotiation on every connection, too late to fix
23:11 πŸ”— dw_ maybe getting ~20-25/sec
23:11 πŸ”— dw_ 1.3MiB/sec of mostly 404s
23:11 πŸ”— JAA So we're talking ~14 days of scanning.
23:11 πŸ”— dw_ thats almost exactly right
23:12 πŸ”— JAA Does TPB use Cloudflare? Doesn't look like it from the TLS certificate.
23:12 πŸ”— dw_ ip addresses are def cloudflare
23:13 πŸ”— JAA Yeah, true.
23:13 πŸ”— JAA They probably gave their COMODO cert to Cloudflare then.
23:14 πŸ”— JAA Or cert keys, to be precise.
23:14 πŸ”— dw_ im really surprised it's possible to crawl so fast to be honset
23:14 πŸ”— dw_ maybe they're on some special config, its probably one of cloudflare's biggest sites too
23:15 πŸ”— Despatche has quit IRC (Remote host closed the connection)
23:15 πŸ”— JAA One semi-annoying thing about retrieving https://thepiratebay.org/torrent/$id/ is that external links will be broken since they always have a name/slug after the ID.
23:15 πŸ”— JAA Works fine for what you're doing, but I'm thinking about WARCs at the moment.
23:15 πŸ”— dw_ i think its in link rel="canonical"
23:15 πŸ”— Despatche has joined #archiveteam-ot
23:16 πŸ”— JAA Nope
23:16 πŸ”— dw_ bah, nope :/
23:16 πŸ”— dw_ can probably figure out the slug algorithm easily enough :P
23:16 πŸ”— JAA Possibly. Spaces and + get replaced by underscores.
23:17 πŸ”— dw_ html entities escaped, then ampersands stripped too by the looks of it
23:17 πŸ”— dw_ really easy to cross-check this once the dump is complete..
23:17 πŸ”— dw_ pull out examples of every char in title, and longest titles, search for it, then compare
23:17 πŸ”— JAA Another option would be to go through the uploader's page.
23:18 πŸ”— dw_ its limited to 30ish pages, and only for recent results. their search by author cuts off at some point in history
23:18 πŸ”— JAA Ah right.
23:18 πŸ”— JAA Same as "recent torrents" then.
23:18 πŸ”— JAA Meh
23:19 πŸ”— JAA And categories and probably everything.
23:19 πŸ”— dw_ enumeration options are slim
23:20 πŸ”— dw_ archive.org's CDX server or whatever it's called had hardly any /torrent/ urls
23:20 πŸ”— dw_ doesn't seem to be any way to go from their public dump to a torrent without searching every single one of them, or some weird thing where you combine many approach like category/author search, but it doesn't seem likely to be as complete than just brute force
23:21 πŸ”— dw_ the only thing i didnt check was common crawl, it should be possible to extract urls from it too, but again completeness
23:22 πŸ”— JAA Ah, comment pagination happens through JS also. Meh
23:23 πŸ”— dw_ i also want to brute force all the missing ids from the last backup
23:23 πŸ”— dw_ or at least cross check that backup against what their official dump claims
23:23 πŸ”— dw_ their id scheme is sufficiently weird that a scrape could easily miss stuff
23:24 πŸ”— ColdIce has joined #archiveteam-ot
23:25 πŸ”— JAA I can see a recursive crawl being incomplete, but a full ID bruteforce should be fine, no?
23:26 πŸ”— dw_ brute force so far is just from max(last backup..max(top torrent on recent torrents page)
23:27 πŸ”— JAA By the way, the ampersand of HTML entities isn't simply stripped in URLs but replaced with an underscore. Same for the semicolon. So "VΓ€ter" becomes "V_auml_ter".
23:27 πŸ”— JAA Doing this reliably will be very tricky.
23:27 πŸ”— dw_ yeah, looks like something like do_underscore_replace(htmlescapechars($title)) or something in php
23:28 πŸ”— dw_ pretty sure its truncated too
23:28 πŸ”— JAA It might be possible to search for the title, but that probably fails in some situations as well.
23:28 πŸ”— dw_ 5 million search requests :( they'd almost certainly ban :P
23:29 πŸ”— JAA A ban has never kept us away.
23:29 πŸ”— JAA :-)
23:29 πŸ”— JAA And yes, titles are truncated, but not only in URLs.
23:29 πŸ”— dw_ hehe
23:30 πŸ”— JAA Random example from the top list: https://thepiratebay.org/torrent/29943379
23:32 πŸ”— dw_ i wondered about efficient approaches to scanning their id space
23:32 πŸ”— JAA And we arrived at "let's make millions of search queries". :-)
23:32 πŸ”— dw_ there have clearly been manual sequence resets at some point, e.g. there is torrent 9999999
23:33 πŸ”— dw_ hoping to draw a graph of torrent count by week and look for any obvious dips (assuming they're discernible from downtimes), that might help spot missing id ranges
23:33 πŸ”— JAA Yeah, I remember also something about them bumping the IDs at some point to make it seem like there were much more torrents or something. Might've been a different site though.

irclogger-viewer