#archiveteam-bs 2017-09-23,Sat

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***ld1 has quit IRC (Quit: ~) [00:03]
..... (idle for 24mn)
VADemon has joined #archiveteam-bs [00:27]
........ (idle for 39mn)
schbirid2 has joined #archiveteam-bs [01:06]
schbirid has quit IRC (Read error: Operation timed out) [01:11]
.... (idle for 16mn)
decay has quit IRC (Quit: leaving)
decay has joined #archiveteam-bs
[01:27]
...... (idle for 28mn)
jrwrYa JAA
I figured it was a nice common base to support
since it makes it a updatable target that is common
[01:59]
***pizzaiolo has quit IRC (Remote host closed the connection) [02:08]
i0npulse has quit IRC (Ping timeout: 255 seconds) [02:21]
i0npulse has joined #archiveteam-bs [02:35]
........... (idle for 51mn)
Petri152 has quit IRC (Ping timeout: 246 seconds)
BlueMaxim has quit IRC (Quit: Leaving)
[03:26]
godanei'm looking at setting up a patreon page [03:41]
***arkhive has joined #archiveteam-bs [03:49]
SketchCowSomeone is archiving all of github
Thinks it'll be 500gb
I mean tb
Offers it to IA
I am going to dee-cline
[03:54]
***VADemon has quit IRC (Quit: left4dead) [04:05]
godanei know 500tb would be like $1 MILLION DOLLARS to host on IA
it maybe close to half of that price these days though
[04:09]
***odemg has quit IRC (Read error: Operation timed out)
odemg has joined #archiveteam-bs
[04:18]
ZexaronS- has joined #archiveteam-bs
ZexaronS has quit IRC (Ping timeout: 260 seconds)
[04:27]
Sk1d has quit IRC (Ping timeout: 194 seconds)
Petri152 has joined #archiveteam-bs
Sk1d has joined #archiveteam-bs
[04:40]
jrwrYa
Unless it was all the issues + wiki content as well SketchCow, I wouldnt bother at all
[04:57]
...... (idle for 28mn)
***BlueMaxim has joined #archiveteam-bs [05:25]
Soni has quit IRC (Ping timeout: 250 seconds)
Soni has joined #archiveteam-bs
[05:33]
hook54321Someone mentioned a while ago here that they thought the torrent tracker Apollo was going to shut down, so I thought that I'd mention that the HTTPs tracker has been down for well over 24 hours now. [05:50]
godanei'm upload one of my Power Rangers WOC tapes
to FOS
i'm off to bed
SketchCow: please don't upload my 'Godane VHS Capture' folder files in the mean time
my wifi could disconnect in my sleep
bbl
[05:59]
.................................. (idle for 2h46mn)
***drumstick has quit IRC (Ping timeout: 370 seconds)
drumstick has joined #archiveteam-bs
[08:46]
............ (idle for 58mn)
BartoCH has joined #archiveteam-bs [09:44]
BlueMaxim has quit IRC (Quit: Leaving) [09:53]
Somebody2SketchCow: I'm glad someone is thinking and practicing grabbing all of github -- but yeah, I don't think mirroring it on IA at this point is a good idea.
Now, if someone had 500TB of space, and wanted to use it to mirror half a petabyte of IA's stuff -- I think that would be welcome.
And if the people archiving github were willing and able to filter their collection for material which had been *removed* from github, ...
*that* material would seem eminently suitable for a (probably private) collection on IA.
But that would also likely be only a couple TB, at most.
(rant over)
[09:59]
godanehttps://www.youtube.com/watch?v=HQ_3g2hUCn4 [10:04]
.... (idle for 18mn)
***fie has quit IRC (Read error: Operation timed out) [10:22]
fie has joined #archiveteam-bs [10:32]
......... (idle for 44mn)
drumstick has quit IRC (Read error: Operation timed out)
drumstick has joined #archiveteam-bs
[11:16]
.... (idle for 18mn)
drumstick has quit IRC (Read error: Operation timed out) [11:34]
..... (idle for 23mn)
etudier has joined #archiveteam-bs [11:57]
..... (idle for 23mn)
Soni has quit IRC (Ping timeout: 190 seconds) [12:20]
Soni has joined #archiveteam-bs [12:26]
....................... (idle for 1h54mn)
arkhive has quit IRC (Ping timeout: 255 seconds) [14:20]
..... (idle for 22mn)
dd0a13f37 has joined #archiveteam-bs [14:42]
dd0a13f37Of the 500tbs, how much is images and similar? It feels like an extreme case of the pareto prinicple, they can't have 500tb of code
If you strip away all files that are over 10mb and entropy>0.95, then deduplicate it, how much then? Can't be more than 1tb
I have a question for IA people- is it possible to get the wayback machine to always use a certain UA for certain sites?
[14:43]
......... (idle for 43mn)
hook54321dd0a13f37: I'm not an IA person, but a feature like that would probably be possible, but I don't think it exists right now. ArchiveBot can use other useragents though. [15:28]
dd0a13f37Only a predefined list [15:28]
JAA... but only from a selected list of four UAs, not a custom string. [15:28]
..... (idle for 21mn)
hook54321In most situations we don't need something outside of those four. [15:49]
dd0a13f37Googlebot? [15:50]
hook54321That could potentially be useful for some sites that use captchas, there's a potential ethical issue with using another bot's useragent though. [15:58]
dd0a13f37That's the moral responsibility of the job submitter
For captchas, some primitive ones can actually be cracked with todays software, nobody bothers cracking them properly since at small scales it's cheaper to hire someone to type them out
Some sites hide content if you're not using googlebot UA
[15:59]
joepie91_dd0a13f37: that's reason for delisting from Google btw
dd0a13f37: https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1
unsure how the exact submission process works
but google forbids sites from serving different content to googlebot than to real agents
(and they occasionally do tests with browser-like agents to verify this)
[16:08]
hook54321A good example of sites that do it are paywalled sites
Well, kinda
At least you can see their articles in Google Cache often
[16:09]
joepie91_also not allowed :)
referer checking *is* allowed though
[16:15]
.... (idle for 16mn)
***dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [16:31]
kisspunchSomebody2: That someone is me--I have looked at deduplicating and didn't make too much progress on random subsets, I'll keep looking. And yes, I'd be happy to do a search for removed stuff after I get a mirror, sounds like a good idea
Basically I've spent about a year trying to figure out how to do this efficiently/cheaper, and in that time 20%+ of my repo list has vanished
(a year of spare time on weekends, ya'know)
So I'm going to see if I can't just mirror now and reduce the size after
To reduce loss in the meantime
Thanks for the support everyone :)
No worries if IA can't host, just worth a try
[16:32]
***dd0a13f37 has joined #archiveteam-bs [16:37]
dd0a13f37Thanks joepie91_, do you know if they'll actually punish them or just tell them to stop?
ah fuck, you need a google account
[16:39]
hook54321It's like how on some social media services you need an account to be able to report content
In other words: Richard Stallman isn't able to report Facebook posts.
[16:46]
***refeed has joined #archiveteam-bs
Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
[16:50]
dd0a13f37Anyone here have a google account and want to report a site violating google's rules then? [16:50]
***Odd0002 has joined #archiveteam-bs
Frogging has quit IRC (Read error: Operation timed out)
Frogging has joined #archiveteam-bs
[16:52]
joepie91_dd0a13f37: they'll be delisted until they fix their shit, last time I checked
as in, not appear in search results at all
[16:53]
dd0a13f37But they'll get some kind of warning [16:54]
joepie91_the general idea is that google doesn't want misleading listings [16:54]
dd0a13f37Yes, but they won't be punished? [16:54]
joepie91_no, just a delist and a notification in the search console
yes?
by being delisted
lol
[16:54]
dd0a13f37Yes but if they get informed beforehand
Or won't they?
Also, what would be the effect? Can they still give google spceial treathemt somehow?
[16:54]
zinoDelisting is basically the death penalty. There is no harder punishment I can think of. [16:55]
joepie91_? [16:55]
dd0a13f37Or will they be forced to show them the content behind a paywall like everyone gets now? [16:55]
zinodd0a13f37: They are fine if they special treat Google referers. [16:55]
dd0a13f37Oh okay, so it won't change anything? [16:55]
joepie91_dd0a13f37: I'm not sure what you're trying to get at - the result of misleading content is a delist, and that gets removed when the content is fixed
that is all there is to it
[16:55]
dd0a13f37Yes, so it's not permanent
No point in reporting them
[16:55]
joepie91_what [16:56]
dd0a13f37OTOH, you could scrape with fake referrer
oh well, have to go
[16:56]
joepie91_the point is to coerce sites into not doing that, so of course there's a point in reporting them [16:56]
dd0a13f37svd.se if anyone wants to report [16:56]
joepie91_because that means they have to fix-or-sink [16:56]
***Smiley has quit IRC (Read error: Operation timed out)
pizzaiolo has joined #archiveteam-bs
[16:57]
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [17:10]
Smiley has joined #archiveteam-bs
etudier has joined #archiveteam-bs
dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
[17:16]
..... (idle for 23mn)
dd0a13f37 has joined #archiveteam-bs [17:44]
dd0a13f37Sorry for being unclear. My point was, if I report them to google, can they substitute googlebot agent detection for anything else? For example, can they send google their articles so they can index them still
Or will they be forced to disable paywall for anyone with google referrer?
And, when they start complying with google's demands, will they suffer any penalty from having been delisted? Will they get an advance warning to fix their shit since they're a large site, or will they be delisted and then forced to fix it ASAP?
[17:44]
***RichardG_ has joined #archiveteam-bs
RichardG has quit IRC (Read error: Operation timed out)
[17:49]
joepie91_dd0a13f37: the rule that google sets is that the content when a user clicks a search result on google, must be the same (or equivalent) as the content that the googlebot saw; ie. so long as it functions with a google referer, it's fine
dd0a13f37: afaik delisting is immediate and automatic
site doesn't matter
idem for re-listing
[17:51]
dd0a13f37Well, it's a real shame I don't have a google account then [18:00]
FroggingI would say it's easy to sign up but nowadays they force you to give them a phone number [18:02]
dd0a13f37Yes, and they don't allow Tor [18:02]
***etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [18:02]
dd0a13f37If anyone has an account and would like to report them, it would be very good - it's impossible to scrape them as it is right now since everything is behind a paywall
An example URL of this selective behavior can be found at https://www.svd.se/fragan-om-manggifte-provar-tron-pa-det-egna-samhallet , change UA to googlebot and you'll get the whole page
It even sets a cookie device-info=bot which usually is device-info=desktop
[18:04]
refeedwew, it seems like they're more concerned with SE-bots rather than the users [18:14]
dd0a13f37No, it's intentional [18:15]
HCross2dd0a13f37: so if I spoof Googlebot.. I can get all the articles? [18:15]
dd0a13f37It's a paywall, you need to pay $25 a month or something to read the articles
Yes, exactly
[18:15]
HCross2HANG ON A MOMENT [18:15]
dd0a13f37Unltil they fix it [18:15]
refeedokay
I think https://www.google.com/webmasters/tools/spamreportform is the right form to submit it
s/submit/report
[18:22]
..... (idle for 24mn)
***etudier has joined #archiveteam-bs [18:47]
VADemon has joined #archiveteam-bs [18:53]
..... (idle for 21mn)
schbirid2 has quit IRC (Ping timeout: 1208 seconds)
refeed has quit IRC (Read error: Operation timed out)
schbirid has joined #archiveteam-bs
[19:14]
jrwrSo, it IS a Bootleg, but I just got a out of print Anime as a gift from a co-worker on a DVD Set from 2000
im doing 1:1 copies of it now
[19:32]
***schbirid has quit IRC (Remote host closed the connection) [19:40]
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
[19:46]
Xibalba has joined #archiveteam-bs
etudier has joined #archiveteam-bs
schbirid has joined #archiveteam-bs
[19:52]
Somebody2kisspunch: Good luck; glad for the suggestion of separating out removed stuff once a mirror is made. [19:57]
......... (idle for 40mn)
***etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [20:37]
Mateon1 has quit IRC (Read error: Operation timed out)
Mateon1 has joined #archiveteam-bs
[20:42]
...... (idle for 29mn)
etudier has joined #archiveteam-bs [21:11]
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
BartoCH has quit IRC (Quit: WeeChat 1.9)
etudier has joined #archiveteam-bs
icedice has joined #archiveteam-bs
icedice has left
[21:21]
schbirid has quit IRC (Quit: Leaving)
BartoCH has joined #archiveteam-bs
[21:36]
BartoCH has quit IRC (Remote host closed the connection) [21:41]
........ (idle for 36mn)
BartoCH has joined #archiveteam-bs
BartoCH has quit IRC (Remote host closed the connection)
BartoCH has joined #archiveteam-bs
drumstick has joined #archiveteam-bs
[22:17]
.... (idle for 15mn)
pizzaiolo has quit IRC (Quit: pizzaiolo) [22:38]
pizzaiolo has joined #archiveteam-bs [22:43]
jrwrfound a thing http://dh.mundus.xyz/Lynda/
tons and tons of videos I think
[22:51]
***qwebirc18 has joined #archiveteam-bs [22:51]
JAAmundus: I guess that's yours? [22:52]
mundusyes [22:52]
***dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [22:52]
jrwrI had picked it up in my URL Logger
didn't see where it came from
too many damn IRC channels I'm in
[22:52]
mundus#DataHoarder presumably [22:53]
jrwrProlly
Im in 193 Channel ATM
[22:53]
***Odd0002 has quit IRC (ZNC - http://znc.in) [22:53]
mundusthat's all lynda courses as of 2 days ago
2.8TB
also just finished hacking each folder into a torrent
*hashing
but haven't added to a client yet
[22:54]
qwebirc18Are you sure nobody has ripped it before? Look through torrent indexes and see if you can avoid pointless splitting of seeders.
Anyone here ever heard about the ".vec" format? file doesn't identify it
https://front.e-pages.dk/data/Sun59c6e5cd1de35/dagen/620/vector/42.vec
First 70 chars: 0#0#1024#1449!S4e4b4cBM034c577cL037a577cL037a5693L03515693L031c56afL03
[22:56]
***qwebirc18 is now known as dd0a13f37 [22:58]
JAAFuck LinkedIn. When you access a page with a UA that isn't detected as a browser, they respond with HTTP status "999 Request denied". Because obviously everyone should be making up their own status codes instead of using 4xx. [23:00]
jrwrholy shit
they really use 999
WTF
[23:00]
JAAFor example, try: curl -v https://www.linkedin.com/in/nmsanchez [23:00]
munduswtf [23:01]
astridgeocities used that exact same code to say "fuck off you're ratelimited" [23:01]
jrwrhaha
BATTLE MODE 0999
[23:01]
***qwebirc78 has joined #archiveteam-bs [23:02]
jrwrgod I'm a nerd [23:02]
astridi hate nerds [23:02]
qwebirc78Linkedin blacklisting tor is an order of magnitude worse [23:02]
jrwrWell its understandable [23:02]
JAAHere's another weird one: http://og.infg.com.br/in/18932010-7b0-a07/FT1086A/420/MAPA-VIOLENCIA.png returns HTTP 750. [23:02]
qwebirc78No, it isn't. What good reason is there to block them from reading? [23:03]
***dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [23:03]
qwebirc78I can understand creating accounts since it's not anonymous anyway [23:03]
***qwebirc78 is now known as dd0a13f37 [23:03]
dd0a13f37But there's no point in blocking people from browsing [23:03]
***BlueMaxim has joined #archiveteam-bs [23:05]
jrwrdd0a13f37: cuts down on the WAF log spam? [23:19]
dd0a13f37I don't think they look through their logs manually
they either just write them to /dev/null or automate it
And they usually use proxies, not tor
[23:20]
jrwrgotta get my proxy list and a good proxy judge setup [23:27]
dd0a13f37They're all blacklisted to hell and back
well this is fucking retarded
I register on a site, enter sharklasers email, register fine
go to enter password page, "username has illegal character"
bravo
[23:29]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)