Time |
Nickname |
Message |
00:03
🔗
|
|
ld1 has quit IRC (Quit: ~) |
00:27
🔗
|
|
VADemon has joined #archiveteam-bs |
01:06
🔗
|
|
schbirid2 has joined #archiveteam-bs |
01:11
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
01:27
🔗
|
|
decay has quit IRC (Quit: leaving) |
01:31
🔗
|
|
decay has joined #archiveteam-bs |
01:59
🔗
|
jrwr |
Ya JAA |
01:59
🔗
|
jrwr |
I figured it was a nice common base to support |
02:00
🔗
|
jrwr |
since it makes it a updatable target that is common |
02:08
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
02:21
🔗
|
|
i0npulse has quit IRC (Ping timeout: 255 seconds) |
02:35
🔗
|
|
i0npulse has joined #archiveteam-bs |
03:26
🔗
|
|
Petri152 has quit IRC (Ping timeout: 246 seconds) |
03:30
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
03:41
🔗
|
godane |
i'm looking at setting up a patreon page |
03:49
🔗
|
|
arkhive has joined #archiveteam-bs |
03:54
🔗
|
SketchCow |
Someone is archiving all of github |
03:54
🔗
|
SketchCow |
Thinks it'll be 500gb |
03:54
🔗
|
SketchCow |
I mean tb |
03:54
🔗
|
SketchCow |
Offers it to IA |
03:54
🔗
|
SketchCow |
I am going to dee-cline |
04:05
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
04:09
🔗
|
godane |
i know 500tb would be like $1 MILLION DOLLARS to host on IA |
04:10
🔗
|
godane |
it maybe close to half of that price these days though |
04:18
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
04:19
🔗
|
|
odemg has joined #archiveteam-bs |
04:27
🔗
|
|
ZexaronS- has joined #archiveteam-bs |
04:27
🔗
|
|
ZexaronS has quit IRC (Ping timeout: 260 seconds) |
04:40
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:44
🔗
|
|
Petri152 has joined #archiveteam-bs |
04:46
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:57
🔗
|
jrwr |
Ya |
04:57
🔗
|
jrwr |
Unless it was all the issues + wiki content as well SketchCow, I wouldnt bother at all |
05:25
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
05:33
🔗
|
|
Soni has quit IRC (Ping timeout: 250 seconds) |
05:36
🔗
|
|
Soni has joined #archiveteam-bs |
05:50
🔗
|
hook54321 |
Someone mentioned a while ago here that they thought the torrent tracker Apollo was going to shut down, so I thought that I'd mention that the HTTPs tracker has been down for well over 24 hours now. |
05:59
🔗
|
godane |
i'm upload one of my Power Rangers WOC tapes |
05:59
🔗
|
godane |
to FOS |
05:59
🔗
|
godane |
i'm off to bed |
05:59
🔗
|
godane |
SketchCow: please don't upload my 'Godane VHS Capture' folder files in the mean time |
06:00
🔗
|
godane |
my wifi could disconnect in my sleep |
06:00
🔗
|
godane |
bbl |
08:46
🔗
|
|
drumstick has quit IRC (Ping timeout: 370 seconds) |
08:46
🔗
|
|
drumstick has joined #archiveteam-bs |
09:44
🔗
|
|
BartoCH has joined #archiveteam-bs |
09:53
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
09:59
🔗
|
Somebody2 |
SketchCow: I'm glad someone is thinking and practicing grabbing all of github -- but yeah, I don't think mirroring it on IA at this point is a good idea. |
10:00
🔗
|
Somebody2 |
Now, if someone had 500TB of space, and wanted to use it to mirror half a petabyte of IA's stuff -- I think that would be welcome. |
10:01
🔗
|
Somebody2 |
And if the people archiving github were willing and able to filter their collection for material which had been *removed* from github, ... |
10:01
🔗
|
Somebody2 |
*that* material would seem eminently suitable for a (probably private) collection on IA. |
10:02
🔗
|
Somebody2 |
But that would also likely be only a couple TB, at most. |
10:02
🔗
|
Somebody2 |
(rant over) |
10:04
🔗
|
godane |
https://www.youtube.com/watch?v=HQ_3g2hUCn4 |
10:22
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
10:32
🔗
|
|
fie has joined #archiveteam-bs |
11:16
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
11:16
🔗
|
|
drumstick has joined #archiveteam-bs |
11:34
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
11:57
🔗
|
|
etudier has joined #archiveteam-bs |
12:20
🔗
|
|
Soni has quit IRC (Ping timeout: 190 seconds) |
12:26
🔗
|
|
Soni has joined #archiveteam-bs |
14:20
🔗
|
|
arkhive has quit IRC (Ping timeout: 255 seconds) |
14:42
🔗
|
|
dd0a13f37 has joined #archiveteam-bs |
14:43
🔗
|
dd0a13f37 |
Of the 500tbs, how much is images and similar? It feels like an extreme case of the pareto prinicple, they can't have 500tb of code |
14:44
🔗
|
dd0a13f37 |
If you strip away all files that are over 10mb and entropy>0.95, then deduplicate it, how much then? Can't be more than 1tb |
14:45
🔗
|
dd0a13f37 |
I have a question for IA people- is it possible to get the wayback machine to always use a certain UA for certain sites? |
15:28
🔗
|
hook54321 |
dd0a13f37: I'm not an IA person, but a feature like that would probably be possible, but I don't think it exists right now. ArchiveBot can use other useragents though. |
15:28
🔗
|
dd0a13f37 |
Only a predefined list |
15:28
🔗
|
JAA |
... but only from a selected list of four UAs, not a custom string. |
15:49
🔗
|
hook54321 |
In most situations we don't need something outside of those four. |
15:50
🔗
|
dd0a13f37 |
Googlebot? |
15:58
🔗
|
hook54321 |
That could potentially be useful for some sites that use captchas, there's a potential ethical issue with using another bot's useragent though. |
15:59
🔗
|
dd0a13f37 |
That's the moral responsibility of the job submitter |
16:00
🔗
|
dd0a13f37 |
For captchas, some primitive ones can actually be cracked with todays software, nobody bothers cracking them properly since at small scales it's cheaper to hire someone to type them out |
16:00
🔗
|
dd0a13f37 |
Some sites hide content if you're not using googlebot UA |
16:08
🔗
|
joepie91_ |
dd0a13f37: that's reason for delisting from Google btw |
16:08
🔗
|
joepie91_ |
dd0a13f37: https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1 |
16:08
🔗
|
joepie91_ |
unsure how the exact submission process works |
16:09
🔗
|
joepie91_ |
but google forbids sites from serving different content to googlebot than to real agents |
16:09
🔗
|
joepie91_ |
(and they occasionally do tests with browser-like agents to verify this) |
16:09
🔗
|
hook54321 |
A good example of sites that do it are paywalled sites |
16:09
🔗
|
hook54321 |
Well, kinda |
16:10
🔗
|
hook54321 |
At least you can see their articles in Google Cache often |
16:15
🔗
|
joepie91_ |
also not allowed :) |
16:15
🔗
|
joepie91_ |
referer checking *is* allowed though |
16:31
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
16:32
🔗
|
kisspunch |
Somebody2: That someone is me--I have looked at deduplicating and didn't make too much progress on random subsets, I'll keep looking. And yes, I'd be happy to do a search for removed stuff after I get a mirror, sounds like a good idea |
16:33
🔗
|
kisspunch |
Basically I've spent about a year trying to figure out how to do this efficiently/cheaper, and in that time 20%+ of my repo list has vanished |
16:33
🔗
|
kisspunch |
(a year of spare time on weekends, ya'know) |
16:34
🔗
|
kisspunch |
So I'm going to see if I can't just mirror now and reduce the size after |
16:34
🔗
|
kisspunch |
To reduce loss in the meantime |
16:37
🔗
|
kisspunch |
Thanks for the support everyone :) |
16:37
🔗
|
kisspunch |
No worries if IA can't host, just worth a try |
16:37
🔗
|
|
dd0a13f37 has joined #archiveteam-bs |
16:39
🔗
|
dd0a13f37 |
Thanks joepie91_, do you know if they'll actually punish them or just tell them to stop? |
16:40
🔗
|
dd0a13f37 |
ah fuck, you need a google account |
16:46
🔗
|
hook54321 |
It's like how on some social media services you need an account to be able to report content |
16:47
🔗
|
hook54321 |
In other words: Richard Stallman isn't able to report Facebook posts. |
16:50
🔗
|
|
refeed has joined #archiveteam-bs |
16:50
🔗
|
|
Odd0002 has quit IRC (Quit: ZNC - http://znc.in) |
16:50
🔗
|
dd0a13f37 |
Anyone here have a google account and want to report a site violating google's rules then? |
16:52
🔗
|
|
Odd0002 has joined #archiveteam-bs |
16:52
🔗
|
|
Frogging has quit IRC (Read error: Operation timed out) |
16:53
🔗
|
|
Frogging has joined #archiveteam-bs |
16:53
🔗
|
joepie91_ |
dd0a13f37: they'll be delisted until they fix their shit, last time I checked |
16:54
🔗
|
joepie91_ |
as in, not appear in search results at all |
16:54
🔗
|
dd0a13f37 |
But they'll get some kind of warning |
16:54
🔗
|
joepie91_ |
the general idea is that google doesn't want misleading listings |
16:54
🔗
|
dd0a13f37 |
Yes, but they won't be punished? |
16:54
🔗
|
joepie91_ |
no, just a delist and a notification in the search console |
16:54
🔗
|
joepie91_ |
yes? |
16:54
🔗
|
joepie91_ |
by being delisted |
16:54
🔗
|
joepie91_ |
lol |
16:54
🔗
|
dd0a13f37 |
Yes but if they get informed beforehand |
16:54
🔗
|
dd0a13f37 |
Or won't they? |
16:55
🔗
|
dd0a13f37 |
Also, what would be the effect? Can they still give google spceial treathemt somehow? |
16:55
🔗
|
zino |
Delisting is basically the death penalty. There is no harder punishment I can think of. |
16:55
🔗
|
joepie91_ |
? |
16:55
🔗
|
dd0a13f37 |
Or will they be forced to show them the content behind a paywall like everyone gets now? |
16:55
🔗
|
zino |
dd0a13f37: They are fine if they special treat Google referers. |
16:55
🔗
|
dd0a13f37 |
Oh okay, so it won't change anything? |
16:55
🔗
|
joepie91_ |
dd0a13f37: I'm not sure what you're trying to get at - the result of misleading content is a delist, and that gets removed when the content is fixed |
16:55
🔗
|
joepie91_ |
that is all there is to it |
16:55
🔗
|
dd0a13f37 |
Yes, so it's not permanent |
16:56
🔗
|
dd0a13f37 |
No point in reporting them |
16:56
🔗
|
joepie91_ |
what |
16:56
🔗
|
dd0a13f37 |
OTOH, you could scrape with fake referrer |
16:56
🔗
|
dd0a13f37 |
oh well, have to go |
16:56
🔗
|
joepie91_ |
the point is to coerce sites into not doing that, so of course there's a point in reporting them |
16:56
🔗
|
dd0a13f37 |
svd.se if anyone wants to report |
16:56
🔗
|
joepie91_ |
because that means they have to fix-or-sink |
16:57
🔗
|
|
Smiley has quit IRC (Read error: Operation timed out) |
17:01
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
17:10
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
17:16
🔗
|
|
Smiley has joined #archiveteam-bs |
17:17
🔗
|
|
etudier has joined #archiveteam-bs |
17:21
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
17:44
🔗
|
|
dd0a13f37 has joined #archiveteam-bs |
17:44
🔗
|
dd0a13f37 |
Sorry for being unclear. My point was, if I report them to google, can they substitute googlebot agent detection for anything else? For example, can they send google their articles so they can index them still |
17:45
🔗
|
dd0a13f37 |
Or will they be forced to disable paywall for anyone with google referrer? |
17:46
🔗
|
dd0a13f37 |
And, when they start complying with google's demands, will they suffer any penalty from having been delisted? Will they get an advance warning to fix their shit since they're a large site, or will they be delisted and then forced to fix it ASAP? |
17:49
🔗
|
|
RichardG_ has joined #archiveteam-bs |
17:49
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
17:51
🔗
|
joepie91_ |
dd0a13f37: the rule that google sets is that the content when a user clicks a search result on google, must be the same (or equivalent) as the content that the googlebot saw; ie. so long as it functions with a google referer, it's fine |
17:52
🔗
|
joepie91_ |
dd0a13f37: afaik delisting is immediate and automatic |
17:52
🔗
|
joepie91_ |
site doesn't matter |
17:52
🔗
|
joepie91_ |
idem for re-listing |
18:00
🔗
|
dd0a13f37 |
Well, it's a real shame I don't have a google account then |
18:02
🔗
|
Frogging |
I would say it's easy to sign up but nowadays they force you to give them a phone number |
18:02
🔗
|
dd0a13f37 |
Yes, and they don't allow Tor |
18:02
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
18:04
🔗
|
dd0a13f37 |
If anyone has an account and would like to report them, it would be very good - it's impossible to scrape them as it is right now since everything is behind a paywall |
18:07
🔗
|
dd0a13f37 |
An example URL of this selective behavior can be found at https://www.svd.se/fragan-om-manggifte-provar-tron-pa-det-egna-samhallet , change UA to googlebot and you'll get the whole page |
18:09
🔗
|
dd0a13f37 |
It even sets a cookie device-info=bot which usually is device-info=desktop |
18:14
🔗
|
refeed |
wew, it seems like they're more concerned with SE-bots rather than the users |
18:15
🔗
|
dd0a13f37 |
No, it's intentional |
18:15
🔗
|
HCross2 |
dd0a13f37: so if I spoof Googlebot.. I can get all the articles? |
18:15
🔗
|
dd0a13f37 |
It's a paywall, you need to pay $25 a month or something to read the articles |
18:15
🔗
|
dd0a13f37 |
Yes, exactly |
18:15
🔗
|
HCross2 |
HANG ON A MOMENT |
18:15
🔗
|
dd0a13f37 |
Unltil they fix it |
18:22
🔗
|
refeed |
okay |
18:22
🔗
|
refeed |
I think https://www.google.com/webmasters/tools/spamreportform is the right form to submit it |
18:23
🔗
|
refeed |
s/submit/report |
18:47
🔗
|
|
etudier has joined #archiveteam-bs |
18:53
🔗
|
|
VADemon has joined #archiveteam-bs |
19:14
🔗
|
|
schbirid2 has quit IRC (Ping timeout: 1208 seconds) |
19:16
🔗
|
|
refeed has quit IRC (Read error: Operation timed out) |
19:20
🔗
|
|
schbirid has joined #archiveteam-bs |
19:32
🔗
|
jrwr |
So, it IS a Bootleg, but I just got a out of print Anime as a gift from a co-worker on a DVD Set from 2000 |
19:33
🔗
|
jrwr |
im doing 1:1 copies of it now |
19:40
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
19:46
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
19:47
🔗
|
|
Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in) |
19:52
🔗
|
|
Xibalba has joined #archiveteam-bs |
19:53
🔗
|
|
etudier has joined #archiveteam-bs |
19:53
🔗
|
|
schbirid has joined #archiveteam-bs |
19:57
🔗
|
Somebody2 |
kisspunch: Good luck; glad for the suggestion of separating out removed stuff once a mirror is made. |
20:37
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
20:42
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
20:42
🔗
|
|
Mateon1 has joined #archiveteam-bs |
21:11
🔗
|
|
etudier has joined #archiveteam-bs |
21:21
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
21:25
🔗
|
|
BartoCH has quit IRC (Quit: WeeChat 1.9) |
21:27
🔗
|
|
etudier has joined #archiveteam-bs |
21:27
🔗
|
|
icedice has joined #archiveteam-bs |
21:27
🔗
|
|
icedice has left |
21:36
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:36
🔗
|
|
BartoCH has joined #archiveteam-bs |
21:41
🔗
|
|
BartoCH has quit IRC (Remote host closed the connection) |
22:17
🔗
|
|
BartoCH has joined #archiveteam-bs |
22:17
🔗
|
|
BartoCH has quit IRC (Remote host closed the connection) |
22:20
🔗
|
|
BartoCH has joined #archiveteam-bs |
22:23
🔗
|
|
drumstick has joined #archiveteam-bs |
22:38
🔗
|
|
pizzaiolo has quit IRC (Quit: pizzaiolo) |
22:43
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
22:51
🔗
|
jrwr |
found a thing http://dh.mundus.xyz/Lynda/ |
22:51
🔗
|
jrwr |
tons and tons of videos I think |
22:51
🔗
|
|
qwebirc18 has joined #archiveteam-bs |
22:52
🔗
|
JAA |
mundus: I guess that's yours? |
22:52
🔗
|
mundus |
yes |
22:52
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
22:52
🔗
|
jrwr |
I had picked it up in my URL Logger |
22:53
🔗
|
jrwr |
didn't see where it came from |
22:53
🔗
|
jrwr |
too many damn IRC channels I'm in |
22:53
🔗
|
mundus |
#DataHoarder presumably |
22:53
🔗
|
jrwr |
Prolly |
22:53
🔗
|
jrwr |
Im in 193 Channel ATM |
22:53
🔗
|
|
Odd0002 has quit IRC (ZNC - http://znc.in) |
22:54
🔗
|
mundus |
that's all lynda courses as of 2 days ago |
22:54
🔗
|
mundus |
2.8TB |
22:55
🔗
|
mundus |
also just finished hacking each folder into a torrent |
22:55
🔗
|
mundus |
*hashing |
22:55
🔗
|
mundus |
but haven't added to a client yet |
22:56
🔗
|
qwebirc18 |
Are you sure nobody has ripped it before? Look through torrent indexes and see if you can avoid pointless splitting of seeders. |
22:56
🔗
|
qwebirc18 |
Anyone here ever heard about the ".vec" format? file doesn't identify it |
22:56
🔗
|
qwebirc18 |
https://front.e-pages.dk/data/Sun59c6e5cd1de35/dagen/620/vector/42.vec |
22:57
🔗
|
qwebirc18 |
First 70 chars: 0#0#1024#1449!S4e4b4cBM034c577cL037a577cL037a5693L03515693L031c56afL03 |
22:58
🔗
|
|
qwebirc18 is now known as dd0a13f37 |
23:00
🔗
|
JAA |
Fuck LinkedIn. When you access a page with a UA that isn't detected as a browser, they respond with HTTP status "999 Request denied". Because obviously everyone should be making up their own status codes instead of using 4xx. |
23:00
🔗
|
jrwr |
holy shit |
23:00
🔗
|
jrwr |
they really use 999 |
23:00
🔗
|
jrwr |
WTF |
23:00
🔗
|
JAA |
For example, try: curl -v https://www.linkedin.com/in/nmsanchez |
23:01
🔗
|
mundus |
wtf |
23:01
🔗
|
astrid |
geocities used that exact same code to say "fuck off you're ratelimited" |
23:01
🔗
|
jrwr |
haha |
23:01
🔗
|
jrwr |
BATTLE MODE 0999 |
23:02
🔗
|
|
qwebirc78 has joined #archiveteam-bs |
23:02
🔗
|
jrwr |
god I'm a nerd |
23:02
🔗
|
astrid |
i hate nerds |
23:02
🔗
|
qwebirc78 |
Linkedin blacklisting tor is an order of magnitude worse |
23:02
🔗
|
jrwr |
Well its understandable |
23:02
🔗
|
JAA |
Here's another weird one: http://og.infg.com.br/in/18932010-7b0-a07/FT1086A/420/MAPA-VIOLENCIA.png returns HTTP 750. |
23:03
🔗
|
qwebirc78 |
No, it isn't. What good reason is there to block them from reading? |
23:03
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
23:03
🔗
|
qwebirc78 |
I can understand creating accounts since it's not anonymous anyway |
23:03
🔗
|
|
qwebirc78 is now known as dd0a13f37 |
23:03
🔗
|
dd0a13f37 |
But there's no point in blocking people from browsing |
23:05
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:19
🔗
|
jrwr |
dd0a13f37: cuts down on the WAF log spam? |
23:20
🔗
|
dd0a13f37 |
I don't think they look through their logs manually |
23:20
🔗
|
dd0a13f37 |
they either just write them to /dev/null or automate it |
23:21
🔗
|
dd0a13f37 |
And they usually use proxies, not tor |
23:27
🔗
|
jrwr |
gotta get my proxy list and a good proxy judge setup |
23:29
🔗
|
dd0a13f37 |
They're all blacklisted to hell and back |
23:31
🔗
|
dd0a13f37 |
well this is fucking retarded |
23:31
🔗
|
dd0a13f37 |
I register on a site, enter sharklasers email, register fine |
23:32
🔗
|
dd0a13f37 |
go to enter password page, "username has illegal character" |
23:32
🔗
|
dd0a13f37 |
bravo |