[00:03] *** ld1 has quit IRC (Quit: ~) [00:27] *** VADemon has joined #archiveteam-bs [01:06] *** schbirid2 has joined #archiveteam-bs [01:11] *** schbirid has quit IRC (Read error: Operation timed out) [01:27] *** decay has quit IRC (Quit: leaving) [01:31] *** decay has joined #archiveteam-bs [01:59] Ya JAA [01:59] I figured it was a nice common base to support [02:00] since it makes it a updatable target that is common [02:08] *** pizzaiolo has quit IRC (Remote host closed the connection) [02:21] *** i0npulse has quit IRC (Ping timeout: 255 seconds) [02:35] *** i0npulse has joined #archiveteam-bs [03:26] *** Petri152 has quit IRC (Ping timeout: 246 seconds) [03:30] *** BlueMaxim has quit IRC (Quit: Leaving) [03:41] i'm looking at setting up a patreon page [03:49] *** arkhive has joined #archiveteam-bs [03:54] Someone is archiving all of github [03:54] Thinks it'll be 500gb [03:54] I mean tb [03:54] Offers it to IA [03:54] I am going to dee-cline [04:05] *** VADemon has quit IRC (Quit: left4dead) [04:09] i know 500tb would be like $1 MILLION DOLLARS to host on IA [04:10] it maybe close to half of that price these days though [04:18] *** odemg has quit IRC (Read error: Operation timed out) [04:19] *** odemg has joined #archiveteam-bs [04:27] *** ZexaronS- has joined #archiveteam-bs [04:27] *** ZexaronS has quit IRC (Ping timeout: 260 seconds) [04:40] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:44] *** Petri152 has joined #archiveteam-bs [04:46] *** Sk1d has joined #archiveteam-bs [04:57] Ya [04:57] Unless it was all the issues + wiki content as well SketchCow, I wouldnt bother at all [05:25] *** BlueMaxim has joined #archiveteam-bs [05:33] *** Soni has quit IRC (Ping timeout: 250 seconds) [05:36] *** Soni has joined #archiveteam-bs [05:50] Someone mentioned a while ago here that they thought the torrent tracker Apollo was going to shut down, so I thought that I'd mention that the HTTPs tracker has been down for well over 24 hours now. [05:59] i'm upload one of my Power Rangers WOC tapes [05:59] to FOS [05:59] i'm off to bed [05:59] SketchCow: please don't upload my 'Godane VHS Capture' folder files in the mean time [06:00] my wifi could disconnect in my sleep [06:00] bbl [08:46] *** drumstick has quit IRC (Ping timeout: 370 seconds) [08:46] *** drumstick has joined #archiveteam-bs [09:44] *** BartoCH has joined #archiveteam-bs [09:53] *** BlueMaxim has quit IRC (Quit: Leaving) [09:59] SketchCow: I'm glad someone is thinking and practicing grabbing all of github -- but yeah, I don't think mirroring it on IA at this point is a good idea. [10:00] Now, if someone had 500TB of space, and wanted to use it to mirror half a petabyte of IA's stuff -- I think that would be welcome. [10:01] And if the people archiving github were willing and able to filter their collection for material which had been *removed* from github, ... [10:01] *that* material would seem eminently suitable for a (probably private) collection on IA. [10:02] But that would also likely be only a couple TB, at most. [10:02] (rant over) [10:04] https://www.youtube.com/watch?v=HQ_3g2hUCn4 [10:22] *** fie has quit IRC (Read error: Operation timed out) [10:32] *** fie has joined #archiveteam-bs [11:16] *** drumstick has quit IRC (Read error: Operation timed out) [11:16] *** drumstick has joined #archiveteam-bs [11:34] *** drumstick has quit IRC (Read error: Operation timed out) [11:57] *** etudier has joined #archiveteam-bs [12:20] *** Soni has quit IRC (Ping timeout: 190 seconds) [12:26] *** Soni has joined #archiveteam-bs [14:20] *** arkhive has quit IRC (Ping timeout: 255 seconds) [14:42] *** dd0a13f37 has joined #archiveteam-bs [14:43] Of the 500tbs, how much is images and similar? It feels like an extreme case of the pareto prinicple, they can't have 500tb of code [14:44] If you strip away all files that are over 10mb and entropy>0.95, then deduplicate it, how much then? Can't be more than 1tb [14:45] I have a question for IA people- is it possible to get the wayback machine to always use a certain UA for certain sites? [15:28] dd0a13f37: I'm not an IA person, but a feature like that would probably be possible, but I don't think it exists right now. ArchiveBot can use other useragents though. [15:28] Only a predefined list [15:28] ... but only from a selected list of four UAs, not a custom string. [15:49] In most situations we don't need something outside of those four. [15:50] Googlebot? [15:58] That could potentially be useful for some sites that use captchas, there's a potential ethical issue with using another bot's useragent though. [15:59] That's the moral responsibility of the job submitter [16:00] For captchas, some primitive ones can actually be cracked with todays software, nobody bothers cracking them properly since at small scales it's cheaper to hire someone to type them out [16:00] Some sites hide content if you're not using googlebot UA [16:08] dd0a13f37: that's reason for delisting from Google btw [16:08] dd0a13f37: https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1 [16:08] unsure how the exact submission process works [16:09] but google forbids sites from serving different content to googlebot than to real agents [16:09] (and they occasionally do tests with browser-like agents to verify this) [16:09] A good example of sites that do it are paywalled sites [16:09] Well, kinda [16:10] At least you can see their articles in Google Cache often [16:15] also not allowed :) [16:15] referer checking *is* allowed though [16:31] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [16:32] Somebody2: That someone is me--I have looked at deduplicating and didn't make too much progress on random subsets, I'll keep looking. And yes, I'd be happy to do a search for removed stuff after I get a mirror, sounds like a good idea [16:33] Basically I've spent about a year trying to figure out how to do this efficiently/cheaper, and in that time 20%+ of my repo list has vanished [16:33] (a year of spare time on weekends, ya'know) [16:34] So I'm going to see if I can't just mirror now and reduce the size after [16:34] To reduce loss in the meantime [16:37] Thanks for the support everyone :) [16:37] No worries if IA can't host, just worth a try [16:37] *** dd0a13f37 has joined #archiveteam-bs [16:39] Thanks joepie91_, do you know if they'll actually punish them or just tell them to stop? [16:40] ah fuck, you need a google account [16:46] It's like how on some social media services you need an account to be able to report content [16:47] In other words: Richard Stallman isn't able to report Facebook posts. [16:50] *** refeed has joined #archiveteam-bs [16:50] *** Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [16:50] Anyone here have a google account and want to report a site violating google's rules then? [16:52] *** Odd0002 has joined #archiveteam-bs [16:52] *** Frogging has quit IRC (Read error: Operation timed out) [16:53] *** Frogging has joined #archiveteam-bs [16:53] dd0a13f37: they'll be delisted until they fix their shit, last time I checked [16:54] as in, not appear in search results at all [16:54] But they'll get some kind of warning [16:54] the general idea is that google doesn't want misleading listings [16:54] Yes, but they won't be punished? [16:54] no, just a delist and a notification in the search console [16:54] yes? [16:54] by being delisted [16:54] lol [16:54] Yes but if they get informed beforehand [16:54] Or won't they? [16:55] Also, what would be the effect? Can they still give google spceial treathemt somehow? [16:55] Delisting is basically the death penalty. There is no harder punishment I can think of. [16:55] ? [16:55] Or will they be forced to show them the content behind a paywall like everyone gets now? [16:55] dd0a13f37: They are fine if they special treat Google referers. [16:55] Oh okay, so it won't change anything? [16:55] dd0a13f37: I'm not sure what you're trying to get at - the result of misleading content is a delist, and that gets removed when the content is fixed [16:55] that is all there is to it [16:55] Yes, so it's not permanent [16:56] No point in reporting them [16:56] what [16:56] OTOH, you could scrape with fake referrer [16:56] oh well, have to go [16:56] the point is to coerce sites into not doing that, so of course there's a point in reporting them [16:56] svd.se if anyone wants to report [16:56] because that means they have to fix-or-sink [16:57] *** Smiley has quit IRC (Read error: Operation timed out) [17:01] *** pizzaiolo has joined #archiveteam-bs [17:10] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [17:16] *** Smiley has joined #archiveteam-bs [17:17] *** etudier has joined #archiveteam-bs [17:21] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [17:44] *** dd0a13f37 has joined #archiveteam-bs [17:44] Sorry for being unclear. My point was, if I report them to google, can they substitute googlebot agent detection for anything else? For example, can they send google their articles so they can index them still [17:45] Or will they be forced to disable paywall for anyone with google referrer? [17:46] And, when they start complying with google's demands, will they suffer any penalty from having been delisted? Will they get an advance warning to fix their shit since they're a large site, or will they be delisted and then forced to fix it ASAP? [17:49] *** RichardG_ has joined #archiveteam-bs [17:49] *** RichardG has quit IRC (Read error: Operation timed out) [17:51] dd0a13f37: the rule that google sets is that the content when a user clicks a search result on google, must be the same (or equivalent) as the content that the googlebot saw; ie. so long as it functions with a google referer, it's fine [17:52] dd0a13f37: afaik delisting is immediate and automatic [17:52] site doesn't matter [17:52] idem for re-listing [18:00] Well, it's a real shame I don't have a google account then [18:02] I would say it's easy to sign up but nowadays they force you to give them a phone number [18:02] Yes, and they don't allow Tor [18:02] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [18:04] If anyone has an account and would like to report them, it would be very good - it's impossible to scrape them as it is right now since everything is behind a paywall [18:07] An example URL of this selective behavior can be found at https://www.svd.se/fragan-om-manggifte-provar-tron-pa-det-egna-samhallet , change UA to googlebot and you'll get the whole page [18:09] It even sets a cookie device-info=bot which usually is device-info=desktop [18:14] wew, it seems like they're more concerned with SE-bots rather than the users [18:15] No, it's intentional [18:15] dd0a13f37: so if I spoof Googlebot.. I can get all the articles? [18:15] It's a paywall, you need to pay $25 a month or something to read the articles [18:15] Yes, exactly [18:15] HANG ON A MOMENT [18:15] Unltil they fix it [18:22] okay [18:22] I think https://www.google.com/webmasters/tools/spamreportform is the right form to submit it [18:23] s/submit/report [18:47] *** etudier has joined #archiveteam-bs [18:53] *** VADemon has joined #archiveteam-bs [19:14] *** schbirid2 has quit IRC (Ping timeout: 1208 seconds) [19:16] *** refeed has quit IRC (Read error: Operation timed out) [19:20] *** schbirid has joined #archiveteam-bs [19:32] So, it IS a Bootleg, but I just got a out of print Anime as a gift from a co-worker on a DVD Set from 2000 [19:33] im doing 1:1 copies of it now [19:40] *** schbirid has quit IRC (Remote host closed the connection) [19:46] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [19:47] *** Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in) [19:52] *** Xibalba has joined #archiveteam-bs [19:53] *** etudier has joined #archiveteam-bs [19:53] *** schbirid has joined #archiveteam-bs [19:57] kisspunch: Good luck; glad for the suggestion of separating out removed stuff once a mirror is made. [20:37] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [20:42] *** Mateon1 has quit IRC (Read error: Operation timed out) [20:42] *** Mateon1 has joined #archiveteam-bs [21:11] *** etudier has joined #archiveteam-bs [21:21] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [21:25] *** BartoCH has quit IRC (Quit: WeeChat 1.9) [21:27] *** etudier has joined #archiveteam-bs [21:27] *** icedice has joined #archiveteam-bs [21:27] *** icedice has left [21:36] *** schbirid has quit IRC (Quit: Leaving) [21:36] *** BartoCH has joined #archiveteam-bs [21:41] *** BartoCH has quit IRC (Remote host closed the connection) [22:17] *** BartoCH has joined #archiveteam-bs [22:17] *** BartoCH has quit IRC (Remote host closed the connection) [22:20] *** BartoCH has joined #archiveteam-bs [22:23] *** drumstick has joined #archiveteam-bs [22:38] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [22:43] *** pizzaiolo has joined #archiveteam-bs [22:51] found a thing http://dh.mundus.xyz/Lynda/ [22:51] tons and tons of videos I think [22:51] *** qwebirc18 has joined #archiveteam-bs [22:52] mundus: I guess that's yours? [22:52] yes [22:52] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [22:52] I had picked it up in my URL Logger [22:53] didn't see where it came from [22:53] too many damn IRC channels I'm in [22:53] #DataHoarder presumably [22:53] Prolly [22:53] Im in 193 Channel ATM [22:53] *** Odd0002 has quit IRC (ZNC - http://znc.in) [22:54] that's all lynda courses as of 2 days ago [22:54] 2.8TB [22:55] also just finished hacking each folder into a torrent [22:55] *hashing [22:55] but haven't added to a client yet [22:56] Are you sure nobody has ripped it before? Look through torrent indexes and see if you can avoid pointless splitting of seeders. [22:56] Anyone here ever heard about the ".vec" format? file doesn't identify it [22:56] https://front.e-pages.dk/data/Sun59c6e5cd1de35/dagen/620/vector/42.vec [22:57] First 70 chars: 0#0#1024#1449!S4e4b4cBM034c577cL037a577cL037a5693L03515693L031c56afL03 [22:58] *** qwebirc18 is now known as dd0a13f37 [23:00] Fuck LinkedIn. When you access a page with a UA that isn't detected as a browser, they respond with HTTP status "999 Request denied". Because obviously everyone should be making up their own status codes instead of using 4xx. [23:00] holy shit [23:00] they really use 999 [23:00] WTF [23:00] For example, try: curl -v https://www.linkedin.com/in/nmsanchez [23:01] wtf [23:01] geocities used that exact same code to say "fuck off you're ratelimited" [23:01] haha [23:01] BATTLE MODE 0999 [23:02] *** qwebirc78 has joined #archiveteam-bs [23:02] god I'm a nerd [23:02] i hate nerds [23:02] Linkedin blacklisting tor is an order of magnitude worse [23:02] Well its understandable [23:02] Here's another weird one: http://og.infg.com.br/in/18932010-7b0-a07/FT1086A/420/MAPA-VIOLENCIA.png returns HTTP 750. [23:03] No, it isn't. What good reason is there to block them from reading? [23:03] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [23:03] I can understand creating accounts since it's not anonymous anyway [23:03] *** qwebirc78 is now known as dd0a13f37 [23:03] But there's no point in blocking people from browsing [23:05] *** BlueMaxim has joined #archiveteam-bs [23:19] dd0a13f37: cuts down on the WAF log spam? [23:20] I don't think they look through their logs manually [23:20] they either just write them to /dev/null or automate it [23:21] And they usually use proxies, not tor [23:27] gotta get my proxy list and a good proxy judge setup [23:29] They're all blacklisted to hell and back [23:31] well this is fucking retarded [23:31] I register on a site, enter sharklasers email, register fine [23:32] go to enter password page, "username has illegal character" [23:32] bravo