#archiveteam-bs 2017-09-23,Sat

↑back Search

Time Nickname Message
00:03 🔗 ld1 has quit IRC (Quit: ~)
00:27 🔗 VADemon has joined #archiveteam-bs
01:06 🔗 schbirid2 has joined #archiveteam-bs
01:11 🔗 schbirid has quit IRC (Read error: Operation timed out)
01:27 🔗 decay has quit IRC (Quit: leaving)
01:31 🔗 decay has joined #archiveteam-bs
01:59 🔗 jrwr Ya JAA
01:59 🔗 jrwr I figured it was a nice common base to support
02:00 🔗 jrwr since it makes it a updatable target that is common
02:08 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
02:21 🔗 i0npulse has quit IRC (Ping timeout: 255 seconds)
02:35 🔗 i0npulse has joined #archiveteam-bs
03:26 🔗 Petri152 has quit IRC (Ping timeout: 246 seconds)
03:30 🔗 BlueMaxim has quit IRC (Quit: Leaving)
03:41 🔗 godane i'm looking at setting up a patreon page
03:49 🔗 arkhive has joined #archiveteam-bs
03:54 🔗 SketchCow Someone is archiving all of github
03:54 🔗 SketchCow Thinks it'll be 500gb
03:54 🔗 SketchCow I mean tb
03:54 🔗 SketchCow Offers it to IA
03:54 🔗 SketchCow I am going to dee-cline
04:05 🔗 VADemon has quit IRC (Quit: left4dead)
04:09 🔗 godane i know 500tb would be like $1 MILLION DOLLARS to host on IA
04:10 🔗 godane it maybe close to half of that price these days though
04:18 🔗 odemg has quit IRC (Read error: Operation timed out)
04:19 🔗 odemg has joined #archiveteam-bs
04:27 🔗 ZexaronS- has joined #archiveteam-bs
04:27 🔗 ZexaronS has quit IRC (Ping timeout: 260 seconds)
04:40 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:44 🔗 Petri152 has joined #archiveteam-bs
04:46 🔗 Sk1d has joined #archiveteam-bs
04:57 🔗 jrwr Ya
04:57 🔗 jrwr Unless it was all the issues + wiki content as well SketchCow, I wouldnt bother at all
05:25 🔗 BlueMaxim has joined #archiveteam-bs
05:33 🔗 Soni has quit IRC (Ping timeout: 250 seconds)
05:36 🔗 Soni has joined #archiveteam-bs
05:50 🔗 hook54321 Someone mentioned a while ago here that they thought the torrent tracker Apollo was going to shut down, so I thought that I'd mention that the HTTPs tracker has been down for well over 24 hours now.
05:59 🔗 godane i'm upload one of my Power Rangers WOC tapes
05:59 🔗 godane to FOS
05:59 🔗 godane i'm off to bed
05:59 🔗 godane SketchCow: please don't upload my 'Godane VHS Capture' folder files in the mean time
06:00 🔗 godane my wifi could disconnect in my sleep
06:00 🔗 godane bbl
08:46 🔗 drumstick has quit IRC (Ping timeout: 370 seconds)
08:46 🔗 drumstick has joined #archiveteam-bs
09:44 🔗 BartoCH has joined #archiveteam-bs
09:53 🔗 BlueMaxim has quit IRC (Quit: Leaving)
09:59 🔗 Somebody2 SketchCow: I'm glad someone is thinking and practicing grabbing all of github -- but yeah, I don't think mirroring it on IA at this point is a good idea.
10:00 🔗 Somebody2 Now, if someone had 500TB of space, and wanted to use it to mirror half a petabyte of IA's stuff -- I think that would be welcome.
10:01 🔗 Somebody2 And if the people archiving github were willing and able to filter their collection for material which had been *removed* from github, ...
10:01 🔗 Somebody2 *that* material would seem eminently suitable for a (probably private) collection on IA.
10:02 🔗 Somebody2 But that would also likely be only a couple TB, at most.
10:02 🔗 Somebody2 (rant over)
10:04 🔗 godane https://www.youtube.com/watch?v=HQ_3g2hUCn4
10:22 🔗 fie has quit IRC (Read error: Operation timed out)
10:32 🔗 fie has joined #archiveteam-bs
11:16 🔗 drumstick has quit IRC (Read error: Operation timed out)
11:16 🔗 drumstick has joined #archiveteam-bs
11:34 🔗 drumstick has quit IRC (Read error: Operation timed out)
11:57 🔗 etudier has joined #archiveteam-bs
12:20 🔗 Soni has quit IRC (Ping timeout: 190 seconds)
12:26 🔗 Soni has joined #archiveteam-bs
14:20 🔗 arkhive has quit IRC (Ping timeout: 255 seconds)
14:42 🔗 dd0a13f37 has joined #archiveteam-bs
14:43 🔗 dd0a13f37 Of the 500tbs, how much is images and similar? It feels like an extreme case of the pareto prinicple, they can't have 500tb of code
14:44 🔗 dd0a13f37 If you strip away all files that are over 10mb and entropy>0.95, then deduplicate it, how much then? Can't be more than 1tb
14:45 🔗 dd0a13f37 I have a question for IA people- is it possible to get the wayback machine to always use a certain UA for certain sites?
15:28 🔗 hook54321 dd0a13f37: I'm not an IA person, but a feature like that would probably be possible, but I don't think it exists right now. ArchiveBot can use other useragents though.
15:28 🔗 dd0a13f37 Only a predefined list
15:28 🔗 JAA ... but only from a selected list of four UAs, not a custom string.
15:49 🔗 hook54321 In most situations we don't need something outside of those four.
15:50 🔗 dd0a13f37 Googlebot?
15:58 🔗 hook54321 That could potentially be useful for some sites that use captchas, there's a potential ethical issue with using another bot's useragent though.
15:59 🔗 dd0a13f37 That's the moral responsibility of the job submitter
16:00 🔗 dd0a13f37 For captchas, some primitive ones can actually be cracked with todays software, nobody bothers cracking them properly since at small scales it's cheaper to hire someone to type them out
16:00 🔗 dd0a13f37 Some sites hide content if you're not using googlebot UA
16:08 🔗 joepie91_ dd0a13f37: that's reason for delisting from Google btw
16:08 🔗 joepie91_ dd0a13f37: https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1
16:08 🔗 joepie91_ unsure how the exact submission process works
16:09 🔗 joepie91_ but google forbids sites from serving different content to googlebot than to real agents
16:09 🔗 joepie91_ (and they occasionally do tests with browser-like agents to verify this)
16:09 🔗 hook54321 A good example of sites that do it are paywalled sites
16:09 🔗 hook54321 Well, kinda
16:10 🔗 hook54321 At least you can see their articles in Google Cache often
16:15 🔗 joepie91_ also not allowed :)
16:15 🔗 joepie91_ referer checking *is* allowed though
16:31 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
16:32 🔗 kisspunch Somebody2: That someone is me--I have looked at deduplicating and didn't make too much progress on random subsets, I'll keep looking. And yes, I'd be happy to do a search for removed stuff after I get a mirror, sounds like a good idea
16:33 🔗 kisspunch Basically I've spent about a year trying to figure out how to do this efficiently/cheaper, and in that time 20%+ of my repo list has vanished
16:33 🔗 kisspunch (a year of spare time on weekends, ya'know)
16:34 🔗 kisspunch So I'm going to see if I can't just mirror now and reduce the size after
16:34 🔗 kisspunch To reduce loss in the meantime
16:37 🔗 kisspunch Thanks for the support everyone :)
16:37 🔗 kisspunch No worries if IA can't host, just worth a try
16:37 🔗 dd0a13f37 has joined #archiveteam-bs
16:39 🔗 dd0a13f37 Thanks joepie91_, do you know if they'll actually punish them or just tell them to stop?
16:40 🔗 dd0a13f37 ah fuck, you need a google account
16:46 🔗 hook54321 It's like how on some social media services you need an account to be able to report content
16:47 🔗 hook54321 In other words: Richard Stallman isn't able to report Facebook posts.
16:50 🔗 refeed has joined #archiveteam-bs
16:50 🔗 Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
16:50 🔗 dd0a13f37 Anyone here have a google account and want to report a site violating google's rules then?
16:52 🔗 Odd0002 has joined #archiveteam-bs
16:52 🔗 Frogging has quit IRC (Read error: Operation timed out)
16:53 🔗 Frogging has joined #archiveteam-bs
16:53 🔗 joepie91_ dd0a13f37: they'll be delisted until they fix their shit, last time I checked
16:54 🔗 joepie91_ as in, not appear in search results at all
16:54 🔗 dd0a13f37 But they'll get some kind of warning
16:54 🔗 joepie91_ the general idea is that google doesn't want misleading listings
16:54 🔗 dd0a13f37 Yes, but they won't be punished?
16:54 🔗 joepie91_ no, just a delist and a notification in the search console
16:54 🔗 joepie91_ yes?
16:54 🔗 joepie91_ by being delisted
16:54 🔗 joepie91_ lol
16:54 🔗 dd0a13f37 Yes but if they get informed beforehand
16:54 🔗 dd0a13f37 Or won't they?
16:55 🔗 dd0a13f37 Also, what would be the effect? Can they still give google spceial treathemt somehow?
16:55 🔗 zino Delisting is basically the death penalty. There is no harder punishment I can think of.
16:55 🔗 joepie91_ ?
16:55 🔗 dd0a13f37 Or will they be forced to show them the content behind a paywall like everyone gets now?
16:55 🔗 zino dd0a13f37: They are fine if they special treat Google referers.
16:55 🔗 dd0a13f37 Oh okay, so it won't change anything?
16:55 🔗 joepie91_ dd0a13f37: I'm not sure what you're trying to get at - the result of misleading content is a delist, and that gets removed when the content is fixed
16:55 🔗 joepie91_ that is all there is to it
16:55 🔗 dd0a13f37 Yes, so it's not permanent
16:56 🔗 dd0a13f37 No point in reporting them
16:56 🔗 joepie91_ what
16:56 🔗 dd0a13f37 OTOH, you could scrape with fake referrer
16:56 🔗 dd0a13f37 oh well, have to go
16:56 🔗 joepie91_ the point is to coerce sites into not doing that, so of course there's a point in reporting them
16:56 🔗 dd0a13f37 svd.se if anyone wants to report
16:56 🔗 joepie91_ because that means they have to fix-or-sink
16:57 🔗 Smiley has quit IRC (Read error: Operation timed out)
17:01 🔗 pizzaiolo has joined #archiveteam-bs
17:10 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
17:16 🔗 Smiley has joined #archiveteam-bs
17:17 🔗 etudier has joined #archiveteam-bs
17:21 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
17:44 🔗 dd0a13f37 has joined #archiveteam-bs
17:44 🔗 dd0a13f37 Sorry for being unclear. My point was, if I report them to google, can they substitute googlebot agent detection for anything else? For example, can they send google their articles so they can index them still
17:45 🔗 dd0a13f37 Or will they be forced to disable paywall for anyone with google referrer?
17:46 🔗 dd0a13f37 And, when they start complying with google's demands, will they suffer any penalty from having been delisted? Will they get an advance warning to fix their shit since they're a large site, or will they be delisted and then forced to fix it ASAP?
17:49 🔗 RichardG_ has joined #archiveteam-bs
17:49 🔗 RichardG has quit IRC (Read error: Operation timed out)
17:51 🔗 joepie91_ dd0a13f37: the rule that google sets is that the content when a user clicks a search result on google, must be the same (or equivalent) as the content that the googlebot saw; ie. so long as it functions with a google referer, it's fine
17:52 🔗 joepie91_ dd0a13f37: afaik delisting is immediate and automatic
17:52 🔗 joepie91_ site doesn't matter
17:52 🔗 joepie91_ idem for re-listing
18:00 🔗 dd0a13f37 Well, it's a real shame I don't have a google account then
18:02 🔗 Frogging I would say it's easy to sign up but nowadays they force you to give them a phone number
18:02 🔗 dd0a13f37 Yes, and they don't allow Tor
18:02 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
18:04 🔗 dd0a13f37 If anyone has an account and would like to report them, it would be very good - it's impossible to scrape them as it is right now since everything is behind a paywall
18:07 🔗 dd0a13f37 An example URL of this selective behavior can be found at https://www.svd.se/fragan-om-manggifte-provar-tron-pa-det-egna-samhallet , change UA to googlebot and you'll get the whole page
18:09 🔗 dd0a13f37 It even sets a cookie device-info=bot which usually is device-info=desktop
18:14 🔗 refeed wew, it seems like they're more concerned with SE-bots rather than the users
18:15 🔗 dd0a13f37 No, it's intentional
18:15 🔗 HCross2 dd0a13f37: so if I spoof Googlebot.. I can get all the articles?
18:15 🔗 dd0a13f37 It's a paywall, you need to pay $25 a month or something to read the articles
18:15 🔗 dd0a13f37 Yes, exactly
18:15 🔗 HCross2 HANG ON A MOMENT
18:15 🔗 dd0a13f37 Unltil they fix it
18:22 🔗 refeed okay
18:22 🔗 refeed I think https://www.google.com/webmasters/tools/spamreportform is the right form to submit it
18:23 🔗 refeed s/submit/report
18:47 🔗 etudier has joined #archiveteam-bs
18:53 🔗 VADemon has joined #archiveteam-bs
19:14 🔗 schbirid2 has quit IRC (Ping timeout: 1208 seconds)
19:16 🔗 refeed has quit IRC (Read error: Operation timed out)
19:20 🔗 schbirid has joined #archiveteam-bs
19:32 🔗 jrwr So, it IS a Bootleg, but I just got a out of print Anime as a gift from a co-worker on a DVD Set from 2000
19:33 🔗 jrwr im doing 1:1 copies of it now
19:40 🔗 schbirid has quit IRC (Remote host closed the connection)
19:46 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
19:47 🔗 Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
19:52 🔗 Xibalba has joined #archiveteam-bs
19:53 🔗 etudier has joined #archiveteam-bs
19:53 🔗 schbirid has joined #archiveteam-bs
19:57 🔗 Somebody2 kisspunch: Good luck; glad for the suggestion of separating out removed stuff once a mirror is made.
20:37 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
20:42 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
20:42 🔗 Mateon1 has joined #archiveteam-bs
21:11 🔗 etudier has joined #archiveteam-bs
21:21 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
21:25 🔗 BartoCH has quit IRC (Quit: WeeChat 1.9)
21:27 🔗 etudier has joined #archiveteam-bs
21:27 🔗 icedice has joined #archiveteam-bs
21:27 🔗 icedice has left
21:36 🔗 schbirid has quit IRC (Quit: Leaving)
21:36 🔗 BartoCH has joined #archiveteam-bs
21:41 🔗 BartoCH has quit IRC (Remote host closed the connection)
22:17 🔗 BartoCH has joined #archiveteam-bs
22:17 🔗 BartoCH has quit IRC (Remote host closed the connection)
22:20 🔗 BartoCH has joined #archiveteam-bs
22:23 🔗 drumstick has joined #archiveteam-bs
22:38 🔗 pizzaiolo has quit IRC (Quit: pizzaiolo)
22:43 🔗 pizzaiolo has joined #archiveteam-bs
22:51 🔗 jrwr found a thing http://dh.mundus.xyz/Lynda/
22:51 🔗 jrwr tons and tons of videos I think
22:51 🔗 qwebirc18 has joined #archiveteam-bs
22:52 🔗 JAA mundus: I guess that's yours?
22:52 🔗 mundus yes
22:52 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
22:52 🔗 jrwr I had picked it up in my URL Logger
22:53 🔗 jrwr didn't see where it came from
22:53 🔗 jrwr too many damn IRC channels I'm in
22:53 🔗 mundus #DataHoarder presumably
22:53 🔗 jrwr Prolly
22:53 🔗 jrwr Im in 193 Channel ATM
22:53 🔗 Odd0002 has quit IRC (ZNC - http://znc.in)
22:54 🔗 mundus that's all lynda courses as of 2 days ago
22:54 🔗 mundus 2.8TB
22:55 🔗 mundus also just finished hacking each folder into a torrent
22:55 🔗 mundus *hashing
22:55 🔗 mundus but haven't added to a client yet
22:56 🔗 qwebirc18 Are you sure nobody has ripped it before? Look through torrent indexes and see if you can avoid pointless splitting of seeders.
22:56 🔗 qwebirc18 Anyone here ever heard about the ".vec" format? file doesn't identify it
22:56 🔗 qwebirc18 https://front.e-pages.dk/data/Sun59c6e5cd1de35/dagen/620/vector/42.vec
22:57 🔗 qwebirc18 First 70 chars: 0#0#1024#1449!S4e4b4cBM034c577cL037a577cL037a5693L03515693L031c56afL03
22:58 🔗 qwebirc18 is now known as dd0a13f37
23:00 🔗 JAA Fuck LinkedIn. When you access a page with a UA that isn't detected as a browser, they respond with HTTP status "999 Request denied". Because obviously everyone should be making up their own status codes instead of using 4xx.
23:00 🔗 jrwr holy shit
23:00 🔗 jrwr they really use 999
23:00 🔗 jrwr WTF
23:00 🔗 JAA For example, try: curl -v https://www.linkedin.com/in/nmsanchez
23:01 🔗 mundus wtf
23:01 🔗 astrid geocities used that exact same code to say "fuck off you're ratelimited"
23:01 🔗 jrwr haha
23:01 🔗 jrwr BATTLE MODE 0999
23:02 🔗 qwebirc78 has joined #archiveteam-bs
23:02 🔗 jrwr god I'm a nerd
23:02 🔗 astrid i hate nerds
23:02 🔗 qwebirc78 Linkedin blacklisting tor is an order of magnitude worse
23:02 🔗 jrwr Well its understandable
23:02 🔗 JAA Here's another weird one: http://og.infg.com.br/in/18932010-7b0-a07/FT1086A/420/MAPA-VIOLENCIA.png returns HTTP 750.
23:03 🔗 qwebirc78 No, it isn't. What good reason is there to block them from reading?
23:03 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
23:03 🔗 qwebirc78 I can understand creating accounts since it's not anonymous anyway
23:03 🔗 qwebirc78 is now known as dd0a13f37
23:03 🔗 dd0a13f37 But there's no point in blocking people from browsing
23:05 🔗 BlueMaxim has joined #archiveteam-bs
23:19 🔗 jrwr dd0a13f37: cuts down on the WAF log spam?
23:20 🔗 dd0a13f37 I don't think they look through their logs manually
23:20 🔗 dd0a13f37 they either just write them to /dev/null or automate it
23:21 🔗 dd0a13f37 And they usually use proxies, not tor
23:27 🔗 jrwr gotta get my proxy list and a good proxy judge setup
23:29 🔗 dd0a13f37 They're all blacklisted to hell and back
23:31 🔗 dd0a13f37 well this is fucking retarded
23:31 🔗 dd0a13f37 I register on a site, enter sharklasers email, register fine
23:32 🔗 dd0a13f37 go to enter password page, "username has illegal character"
23:32 🔗 dd0a13f37 bravo

irclogger-viewer