Time |
Nickname |
Message |
00:03
🔗
|
|
qwebirc57 has joined #archiveteam-bs |
00:04
🔗
|
qwebirc57 |
unstable fucking piece of shit |
00:04
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
00:04
🔗
|
|
qwebirc57 is now known as dd0a13f37 |
00:04
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
00:07
🔗
|
dd0a13f37 |
I missed anything? |
00:14
🔗
|
|
dd0a13f3T has joined #archiveteam-bs |
00:15
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
00:18
🔗
|
JAA |
Nah |
00:21
🔗
|
|
refeed has joined #archiveteam-bs |
00:22
🔗
|
|
dd0a13f3T is now known as dd0a13f37 |
00:43
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
00:44
🔗
|
dd0a13f37 |
If something is on usenet, is it considered archived? And would it be a good idea to upload library genesis torrents to archive.org, or would that be considered wasting space/bandwidth for piracy? |
00:53
🔗
|
JAA |
I've heard that there might be a copy of libgen at IA already (but not publicly available). Not sure if it's true though. |
00:54
🔗
|
JAA |
And although Usenet is safe-ish, I wouldn't consider it archived. Stuff still disappears from it sooner or later. |
00:54
🔗
|
dd0a13f37 |
You can upload a torrent to IA and have them download it, right? |
00:54
🔗
|
JAA |
Yes, I believe so. |
00:54
🔗
|
dd0a13f37 |
Then you could download their zip file of torrents, upload them to archive.org, then wait for them to pull it |
00:54
🔗
|
dd0a13f37 |
But is it worth it? It's 30tb of data, and it will likely be hidden |
00:56
🔗
|
dd0a13f37 |
The databases are archived |
00:56
🔗
|
dd0a13f37 |
https://archive.org/details/libgen-meta-20150824 |
00:57
🔗
|
|
dd0a13f3 has joined #archiveteam-bs |
00:57
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:01
🔗
|
JAA |
I wouldn't be surprised if either https://archive.org/details/librarygenesis or https://archive.org/details/gen-lib contained a full (hidden) archive. |
01:01
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
01:04
🔗
|
dd0a13f3 |
Should I avoid uploading it, or will it recognize and deuplicate? |
01:05
🔗
|
|
dd0a13f3 is now known as dd0a13f37 |
01:08
🔗
|
dd0a13f37 |
both of these are 3 years old, so they're outdated at any rate |
01:08
🔗
|
godane |
so i'm going thur my web archives that i have not uploaded |
01:09
🔗
|
godane |
or at least thought i uploaded and turned out i didn't |
01:10
🔗
|
dd0a13f37 |
Okay, so if I have a url pointing to a zip file of torrents, can I just give them the URL? |
01:12
🔗
|
dd0a13f37 |
No, apparently not. How does this "derive" stuff work, can I have them unpack a zip file for me? |
01:17
🔗
|
JAA |
dd0a13f37: That's when the collection was created, not when any items in the collection were added/last updated. |
01:17
🔗
|
JAA |
By the way, the graph for the number of items in the second collection of the two looks interesting... |
01:17
🔗
|
dd0a13f37 |
Sure, but who would update such a collection? |
01:18
🔗
|
JAA |
Someone from IA? |
01:18
🔗
|
dd0a13f37 |
2k items is much too small, they have 2m books. Or is it the amount of folders? |
01:20
🔗
|
JAA |
An item can hold an arbitrary number of directories and files (more or less, there seem to be some issues if the items get very large). |
01:21
🔗
|
JAA |
If they have a copy, they certainly wouldn't throw it all into one item, and they also certainly wouldn't throw each book/article into its own item. |
01:21
🔗
|
dd0a13f37 |
The torrents are folders named XXXX000, where XXXX is the unique identified (from 0-2092) |
01:21
🔗
|
JAA |
Well, then 2k sounds about right? |
01:21
🔗
|
dd0a13f37 |
So that could mean there are 2k different oflders |
01:21
🔗
|
dd0a13f37 |
Yeah |
01:22
🔗
|
dd0a13f37 |
Although, looking at the graph it seems more like 1.4k, or is it log? |
01:24
🔗
|
* |
JAA shrugs |
01:25
🔗
|
JAA |
Looks like it might be rounded, so the top of the graph is 1.5k. |
01:25
🔗
|
godane |
i'm reuploading my images.g4tv.com dumps |
01:26
🔗
|
dd0a13f37 |
Should I upload them again then? |
01:26
🔗
|
dd0a13f37 |
They're also missing sci-mag, which is around 50tb |
01:26
🔗
|
JAA |
Definitely ask IA about this first. |
01:27
🔗
|
JAA |
But I doubt that that dataset is going to disappear anytime soon. |
01:27
🔗
|
JAA |
There are certainly several copies stored in various places. |
01:27
🔗
|
JAA |
(Including the ones publicly available via Usenet or torrents. |
01:27
🔗
|
JAA |
) |
01:28
🔗
|
dd0a13f37 |
Yes, that's true. The torrents are seeded, and various mirrors have more or less complete copies. |
01:30
🔗
|
godane |
looks like i upload them nevermind |
01:32
🔗
|
dd0a13f37 |
Sci-mag is worse off, but on the other hand they have sci-hub which has multiple servers run by people who are not subject to any jurisdiction |
01:32
🔗
|
dd0a13f37 |
So both collections should be fine |
01:51
🔗
|
|
drumstick has joined #archiveteam-bs |
02:49
🔗
|
|
VADemon_ has quit IRC (left4dead) |
02:57
🔗
|
hook54321 |
Should I check if a piece of software is already on archive.org before going through all my CDs? |
03:06
🔗
|
dd0a13f37 |
To upload or to download? |
03:07
🔗
|
dd0a13f37 |
If they're somehow part of a collection then it might not be such a huge deal |
03:21
🔗
|
hook54321 |
What do you mean? |
03:24
🔗
|
dd0a13f37 |
If you have some collection of software on 10 different disks that you bought as a bundle then it might have historical value as a whole even if all the software exists separately |
03:49
🔗
|
hook54321 |
it's mostly single disks, bought separately. |
03:57
🔗
|
dd0a13f37 |
Well, it can't be that much storage wasted even if you do upload it twice |
03:57
🔗
|
dd0a13f37 |
could be different versions as well |
03:58
🔗
|
hook54321 |
If it has a different cover then I would definitely upload it |
04:02
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
04:04
🔗
|
|
drumstick has joined #archiveteam-bs |
04:28
🔗
|
hook54321 |
arkiver: I left the channel |
04:46
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:52
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:59
🔗
|
|
refeed has quit IRC (Ping timeout: 600 seconds) |
05:33
🔗
|
|
pizzaiolo has quit IRC (Quit: pizzaiolo) |
05:33
🔗
|
|
refeed has joined #archiveteam-bs |
06:05
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
06:07
🔗
|
|
Dimtree has quit IRC (Read error: Operation timed out) |
06:57
🔗
|
hook54321 |
Did we grab all the duckduckgo stuff? |
07:01
🔗
|
|
Dimtree has joined #archiveteam-bs |
07:21
🔗
|
|
Soni has quit IRC (Ping timeout: 272 seconds) |
07:28
🔗
|
|
Stilett0 has joined #archiveteam-bs |
07:30
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
07:34
🔗
|
|
DFJustin has joined #archiveteam-bs |
07:34
🔗
|
|
swebb sets mode: +o DFJustin |
08:17
🔗
|
|
Asparagir has quit IRC (Asparagir) |
08:25
🔗
|
|
kristian_ has joined #archiveteam-bs |
08:37
🔗
|
|
Honno has joined #archiveteam-bs |
08:52
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |
09:24
🔗
|
|
schbirid has joined #archiveteam-bs |
09:27
🔗
|
|
refeed has quit IRC (Read error: Operation timed out) |
09:35
🔗
|
|
tuluu has quit IRC (Read error: Operation timed out) |
09:52
🔗
|
|
underscor has joined #archiveteam-bs |
09:52
🔗
|
|
swebb sets mode: +o underscor |
10:02
🔗
|
|
tuluu has joined #archiveteam-bs |
10:15
🔗
|
|
BartoCH has joined #archiveteam-bs |
10:29
🔗
|
|
zhongfu_ has quit IRC (Ping timeout: 260 seconds) |
10:29
🔗
|
|
zhongfu has joined #archiveteam-bs |
10:44
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
10:44
🔗
|
|
Mateon1 has joined #archiveteam-bs |
11:00
🔗
|
|
noirscape has joined #archiveteam-bs |
11:09
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
11:13
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
11:19
🔗
|
joepie91_ |
hook54321: definitely upload it; if it turns out to be a duplicate it can always be removed later |
11:19
🔗
|
joepie91_ |
hook54321: there are often many different editions of the same thing |
11:26
🔗
|
|
Soni has joined #archiveteam-bs |
11:36
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:48
🔗
|
|
tuluu_ has joined #archiveteam-bs |
11:49
🔗
|
|
tuluu has quit IRC (Read error: Operation timed out) |
12:14
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
12:33
🔗
|
JAA |
http://www.instructables.com/id/How-to-fix-a-Samsung-external-m3-hard-drive-in-und/ :-) |
13:19
🔗
|
|
wp494 has quit IRC (Read error: Connection reset by peer) |
13:20
🔗
|
|
wp494 has joined #archiveteam-bs |
14:11
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
14:17
🔗
|
|
etudier has joined #archiveteam-bs |
14:21
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
15:19
🔗
|
|
etudier has quit IRC (Remote host closed the connection) |
15:26
🔗
|
second |
They say archive.org did a faulty job of archiving something, but they have the new forums up, can you guys archive their backup? http://gamehacking.org/ Scroll down to news for Aug 10th |
15:26
🔗
|
second |
Or I can archive it but where do I upload it to get it into the archive and what is the proper way to do so? |
15:31
🔗
|
JAA |
second: Is GameHacking itself also in danger, or is this just about the WiiRd forum archive? |
15:32
🔗
|
JAA |
Whatever. GH isn't that big anyway. I'll throw it into ArchiveBot. |
15:35
🔗
|
JAA |
Scratch the "not that big", but it's worth archiving the entire thing. Looks like it has tons of useful resources. |
15:39
🔗
|
|
mls has quit IRC (Read error: Connection reset by peer) |
15:40
🔗
|
|
mls has joined #archiveteam-bs |
15:56
🔗
|
second |
JAA: just the WiiRd forum |
15:56
🔗
|
second |
JAA: you're going to have a hard time archiving the gamehacking parts though |
15:57
🔗
|
second |
Lots of javascript on the page, I was doing it but chrome headless crashed with the setup I was using in docker w/ warcproxy |
15:57
🔗
|
second |
I'll redo it when I get some time and hopefully when firefox headless comes out |
15:57
🔗
|
second |
I have a juypter notebook with the code for doing it |
15:58
🔗
|
second |
going through each page of the manuals and clicking expand |
15:58
🔗
|
second |
If you can archive the other stuff / whatever you can that would be great because I'm only going for the cheat codes |
15:58
🔗
|
second |
Very useful for emulators / games old and new |
15:59
🔗
|
second |
There are some games which are pretty much unplayable without cheat codes because they required certain hardware things |
15:59
🔗
|
second |
Think pokemon trading to evolve or Django the Solar boy requiring the litteral sun |
15:59
🔗
|
JAA |
Hm, I haven't found anything that didn't work for me without JavaScript yet. |
16:00
🔗
|
JAA |
Do you have an example? |
16:02
🔗
|
second |
http://gamehacking.org/game/4366 |
16:02
🔗
|
second |
Click the down arrows on the side |
16:02
🔗
|
JAA |
Ah yeah, just saw that now. |
16:02
🔗
|
second |
They require javascript and outputs the codes for each cheat device |
16:03
🔗
|
second |
Even includes notes |
16:03
🔗
|
second |
Its too bad archivebot can't accept javascript to run on each page or something like selenium commands but archivebot doesn't even work like that from what I gather |
16:03
🔗
|
second |
Its more like a distributed wget |
16:04
🔗
|
second |
perhaps one day it can be upgrade to a very lite and small browser or even a proxy that a archive browser uses to hit pages |
16:04
🔗
|
second |
Still a partial archive is better than no archive |
16:04
🔗
|
second |
JAA: is there an archive of allrecipes? |
16:05
🔗
|
second |
And are you adding gamehacking.org to the archive? |
16:05
🔗
|
JAA |
ArchiveBot does have PhantomJS, but that doesn't work too well and wouldn't help in this case at all. |
16:06
🔗
|
JAA |
Or to be precise, wpull supports PhantomJS, and ArchiveBot uses wpull internally. |
16:06
🔗
|
second |
wpull hasn't been updated in the longest! |
16:06
🔗
|
second |
And isn't taking pull requests either |
16:06
🔗
|
JAA |
But that's just for scrolling and loading scripted stuff. It doesn't work for clicking on things etc. |
16:06
🔗
|
JAA |
Yes, I know. chfoo's been pretty busy, from what I gathered. |
16:06
🔗
|
second |
Is there a more updated version and does it work with youtube-dl now / still? |
16:07
🔗
|
second |
hmm they are actually in here |
16:07
🔗
|
JAA |
I know that youtube-dl is broken on at least most pipelines. |
16:07
🔗
|
second |
They could try giving permissions for others to merge code in or push to the project |
16:07
🔗
|
JAA |
No idea if it works when used directly with wpull. |
16:08
🔗
|
JAA |
There's the fork by FalconK, which has a few bug fixes, but other than that I'm not aware of anyone working on it. |
16:09
🔗
|
JAA |
I've been working on URL priorisation for a while now, but I haven't spent much time on it really. |
16:09
🔗
|
JAA |
FalconK's also pretty busy currently, so yeah, nobody's even trying to maintain it. |
16:11
🔗
|
second |
URL priorisation? |
16:12
🔗
|
second |
What is everyone busy with? |
16:13
🔗
|
second |
Is there a good way to save wikia websites? |
16:13
🔗
|
second |
So I have a lot of questions, its not often I'm on efnet (maybe I'll fix that) and I've been interested in archiving for a long time |
16:14
🔗
|
JAA |
https://gist.github.com/JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a |
16:14
🔗
|
JAA |
This is what I wrote a while ago about my plans. |
16:14
🔗
|
JAA |
It's semi-implemented, but there's still some stuff to do, in particular there is no plugin interface yet, which is necessary to then implement it into ArchiveBot (and grab-site). |
16:15
🔗
|
JAA |
People are busy with real-life stuff, I guess. |
16:16
🔗
|
JAA |
Wikia's just Mediawiki, isn't it? There are two ways to save that, either through WikiTeam (no idea how active that is) or through ArchiveBot. |
16:16
🔗
|
second |
Can the archivebot archive a flakey site which requires login? |
16:17
🔗
|
JAA |
And regarding your earlier questions: there is no record of an archive of allrecipes in ArchiveBot; someone shared a dump in here a few months ago, but that's not a proper archive and can't be included in the Wayback Machine. |
16:18
🔗
|
JAA |
Yes, I added gamehacking.org to ArchiveBot. |
16:18
🔗
|
second |
Yeah, I found that one |
16:18
🔗
|
JAA |
No, login isn't supported by ArchiveBot. |
16:18
🔗
|
JAA |
Neither is CloudFlare DDoS protection and stuff like that, by the way. |
16:18
🔗
|
second |
dang, did not know about cloudflare |
16:19
🔗
|
second |
Why not cloudflare? |
16:19
🔗
|
second |
That is a lot of sites we can't archive then |
16:19
🔗
|
JAA |
Just the DDoS protection bit, i.e. the "Checking your browser" message thingy. |
16:19
🔗
|
JAA |
That requires you to solve a JS challenge... |
16:20
🔗
|
JAA |
There was some discussion on this in here a few days ago. |
16:25
🔗
|
second |
https://github.com/ArchiveTeam/ArchiveBot/issues/216 |
16:27
🔗
|
JAA |
Yes, but cloudflare-scrape is a really shitty and insecure solution. |
16:28
🔗
|
JAA |
second: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-14,Thu&sel=124-150#l120 |
16:28
🔗
|
|
brayden has quit IRC (Read error: Connection reset by peer) |
16:29
🔗
|
|
brayden has joined #archiveteam-bs |
16:29
🔗
|
|
swebb sets mode: +o brayden |
16:36
🔗
|
|
cf has quit IRC (Ping timeout: 260 seconds) |
16:40
🔗
|
|
cf has joined #archiveteam-bs |
16:51
🔗
|
|
etudier has joined #archiveteam-bs |
17:24
🔗
|
|
Stilett0- has joined #archiveteam-bs |
17:27
🔗
|
|
Stilett0- is now known as Stiletto |
17:41
🔗
|
chfoo |
i haven't been feeling like maintaining wpull unfortunately :/ it became a big ball of code |
17:44
🔗
|
|
kristian_ has joined #archiveteam-bs |
17:46
🔗
|
|
dd0a13f37 has joined #archiveteam-bs |
17:46
🔗
|
dd0a13f37 |
JAA: cloudflare whitelists tor using some strange voodoo magic (it's not just the user agent and it works without JS), can we utilize this somehow? |
17:48
🔗
|
dd0a13f37 |
Or, well, it depends on the protection level, but for 90% you can browse Tor. It didn't use to be this way, and if you do "copy as curl" from dev tools and paste into terminal w/ torsocks you still get the warning page |
17:53
🔗
|
JAA |
dd0a13f37: Interesting. If we knew more about it, we could perhaps use it, yes. I wonder how reliable it is though. |
17:53
🔗
|
dd0a13f37 |
It could be details in SSL is handled |
17:53
🔗
|
dd0a13f37 |
That seems like the only difference I can think of |
17:53
🔗
|
JAA |
That would be painful to replicate. |
17:54
🔗
|
|
balrog has quit IRC (Ping timeout: 1208 seconds) |
17:54
🔗
|
JAA |
I guess implementing joepie91_'s code in a wpull plugin is probably easier. |
17:54
🔗
|
dd0a13f37 |
Even if you do "new circuit for this site" and issue the request with a cookie that shouldn't be valid for that IP it still works |
17:54
🔗
|
JAA |
How do you get that cookie initially? |
17:54
🔗
|
dd0a13f37 |
Can't you just add a hook to get a valid cookie without changing any structure? |
17:54
🔗
|
dd0a13f37 |
The site sets it |
17:55
🔗
|
JAA |
Hm |
17:55
🔗
|
dd0a13f37 |
You get a __cfduid cookie |
17:55
🔗
|
dd0a13f37 |
when connecting to a cf site |
17:55
🔗
|
JAA |
So the normal procedure, right. |
17:55
🔗
|
dd0a13f37 |
Are those tied to IPs? |
17:55
🔗
|
JAA |
Yeah, you could implement it as a hook, but the problem is that there is no proper implementation of a bypass. |
17:56
🔗
|
dd0a13f37 |
Because if I copy the exact request and issue it with curl (same cookies, headers, ua) using torsocks it doesn't work |
17:56
🔗
|
dd0a13f37 |
That's the spooky thing |
17:56
🔗
|
dd0a13f37 |
What do you want to bypass? "one more step" or "please turn on js"? |
17:56
🔗
|
JAA |
"Checking your browser" |
17:57
🔗
|
dd0a13f37 |
Isn't there? |
17:57
🔗
|
JAA |
Which is "please turn on JavaScript" if you have JS disabled. |
17:57
🔗
|
JAA |
Not as far as I know. |
17:57
🔗
|
dd0a13f37 |
So what does joepie91's code do? |
17:57
🔗
|
|
balrog has joined #archiveteam-bs |
17:57
🔗
|
|
swebb sets mode: +o balrog |
17:58
🔗
|
JAA |
It parses the challenge and calculates the correct response without executing JavaScript. |
17:58
🔗
|
dd0a13f37 |
Isn't that a bypass? |
17:58
🔗
|
dd0a13f37 |
Or what exactly are you looking to do? |
17:58
🔗
|
JAA |
Yes, it is. |
17:59
🔗
|
JAA |
But it's written in JavaScript, not in Python. |
17:59
🔗
|
JAA |
https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36 |
17:59
🔗
|
dd0a13f37 |
Modify it so it prints the cookie to stdout, then just do shell exec |
17:59
🔗
|
dd0a13f37 |
easy solution |
18:00
🔗
|
JAA |
Yeah, we'd like a pure-Python version so we can avoid installing NodeJS or equivalent. |
18:00
🔗
|
JAA |
I mean, it might work on ArchiveBot where we have PhantomJS anyway, but it'd also be nice to have it in the warrior, for example. |
18:00
🔗
|
dd0a13f37 |
Can't you set it up as a web service? Send challenge page-get response |
18:00
🔗
|
dd0a13f37 |
You only need to do it once |
18:00
🔗
|
JAA |
Huh, that's a nice idea actually. |
18:01
🔗
|
JAA |
A CF protection cracker API :-) |
18:01
🔗
|
dd0a13f37 |
"""protection""" |
18:01
🔗
|
dd0a13f37 |
"""cracker""" |
18:01
🔗
|
JAA |
Hehe |
18:01
🔗
|
dd0a13f37 |
And what about https://github.com/Anorov/cloudflare-scrape ? |
18:02
🔗
|
JAA |
That executes CF's code in NodeJS and is inherently insecure. |
18:02
🔗
|
dd0a13f37 |
So it needs node? |
18:02
🔗
|
JAA |
You can easily trick it into executing arbitrary code, i.e. use it for RCE. |
18:02
🔗
|
JAA |
Yep |
18:02
🔗
|
dd0a13f37 |
Oh ok |
18:04
🔗
|
dd0a13f37 |
So how does the script work, does it take an entire page and return a cookie? |
18:07
🔗
|
JAA |
Which script? |
18:07
🔗
|
dd0a13f37 |
https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36 |
18:09
🔗
|
JAA |
I'm not sure. I've never used it, and I'm not familiar with using JavaScript like that (i.e. outside of a browser) at all. |
18:10
🔗
|
dd0a13f37 |
Me neither |
18:10
🔗
|
dd0a13f37 |
What is executed first? Or is it like a library, so you should look at the exports? |
18:10
🔗
|
JAA |
As far as I can tell, the function in index.js takes the challenge site as an HTML string as the argument and throws out the relevant parts of the JS challenge that you need to combine somehow to get the response. |
18:11
🔗
|
JAA |
The challenge looks like this, in case you're not familiar with it: |
18:11
🔗
|
JAA |
fVbMmUH={"twaBkDiNOR":+((!+[]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]))}; |
18:11
🔗
|
JAA |
fVbMmUH.twaBkDiNOR-=+((+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));fVbMmUH.twaBkDiNOR*=+((!+[]+!![]+!![]+[])+... |
18:12
🔗
|
JAA |
So you need to transform each of those JSFuck-like expressions into a number and then -=, *=, etc. those numbers to get the correct response. |
18:12
🔗
|
dd0a13f37 |
Can't you just use a regex to sanitize it and then execute them unsafely? |
18:12
🔗
|
JAA |
Hahaha, good luck sanitising JSFuck. |
18:13
🔗
|
JAA |
I think cloudflare-scrape tries, but yeah... |
18:13
🔗
|
dd0a13f37 |
Oh, it can execute code, not just return a value? |
18:13
🔗
|
dd0a13f37 |
well then you're fucked |
18:14
🔗
|
JAA |
Yeah. The code would be huge, but you can write *any* JS script with just the six characters ()[]+! used in the challenge. |
18:15
🔗
|
JAA |
https://en.wikipedia.org/wiki/JSFuck |
18:15
🔗
|
dd0a13f37 |
Was that an actual example or just randomly generated? |
18:16
🔗
|
JAA |
That's an actual example. |
18:16
🔗
|
dd0a13f37 |
Where can I find one? |
18:16
🔗
|
dd0a13f37 |
A complete one |
18:18
🔗
|
JAA |
https://gist.github.com/anonymous/85c9b2b57726135a2500a8425b370095 |
18:23
🔗
|
dd0a13f37 |
I don't understand the purpose |
18:24
🔗
|
dd0a13f37 |
Anyone who wants to do evil stuff would just use one of those scripts, and they're using a botnet so they wouln't care about cloudflare infecting them |
18:24
🔗
|
dd0a13f37 |
What's the point? |
18:24
🔗
|
JAA |
Idk either |
18:26
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
18:28
🔗
|
dd0a13f37 |
I don't get it, why can't you just use proxies for the really unfriendly sites? |
18:28
🔗
|
|
Asparagir has joined #archiveteam-bs |
18:29
🔗
|
JAA |
And by the way, it's not just about CloudFlare serving evil code. Anyone could easily trigger cloudflare-scrape from their own server with an appropriate response. |
18:29
🔗
|
|
svchfoo3 sets mode: +o Asparagir |
18:29
🔗
|
|
svchfoo1 sets mode: +o Asparagir |
18:29
🔗
|
dd0a13f37 |
Well, I doubt you care about ACE when running a botnet |
18:30
🔗
|
JAA |
Specifically: https://github.com/Anorov/cloudflare-scrape/blob/ee17a7a145990d6975de0be8d8bf5b0abbd87162/cfscrape/__init__.py#L41-L47 |
18:30
🔗
|
JAA |
Yeah, I just mean in general. |
18:31
🔗
|
dd0a13f37 |
There are commercial proxy providers with clean IPs, the cost of renting a bunch would probably be cheaper than what you spend on hard drives |
18:34
🔗
|
dd0a13f37 |
Got another response from itorrents, he said he would upload database to archive.org and send link, the other three still haven't responded |
18:42
🔗
|
dd0a13f37 |
JAA: Looking at generated jsfuck code, it's usually very long |
18:43
🔗
|
dd0a13f37 |
CF is quite short |
18:43
🔗
|
dd0a13f37 |
so you should be able to use a regex and limit the length |
18:44
🔗
|
dd0a13f37 |
for example encoding the character a is 846 chars encoded |
18:45
🔗
|
dd0a13f37 |
http://www.jsfuck.com/ |
18:47
🔗
|
dd0a13f37 |
And CF's brackets are always empty - [], jsfuck needs to have something inside to eval |
18:48
🔗
|
JAA |
Yeah, I'm aware of that. It's still sloppy though. |
18:49
🔗
|
dd0a13f37 |
It should be safe though |
18:49
🔗
|
JAA |
I don't think you strictly need something inside the brackets to do things in JSFuck, but it probably helps shorten the obfuscated code. |
18:50
🔗
|
dd0a13f37 |
You can never get the eval() you need to do bad things |
18:50
🔗
|
dd0a13f37 |
It shouldn't be turing complete |
18:53
🔗
|
JAA |
Possible |
18:53
🔗
|
JAA |
I don't really know enough about JSFuck to say for sure. |
18:57
🔗
|
|
arkhive has joined #archiveteam-bs |
18:57
🔗
|
dd0a13f37 |
https://esolangs.org/wiki/JSFuck |
18:58
🔗
|
dd0a13f37 |
it needs a big blob which is not possible to encode in under a certain amount of characters, it's ugly as fuck but it should be safe |
18:59
🔗
|
dd0a13f37 |
the eval blob is 831 characters, so if you set an upper limit at 200 you should be fine |
19:02
🔗
|
|
etudier has joined #archiveteam-bs |
19:06
🔗
|
|
etudier has quit IRC (Client Quit) |
19:06
🔗
|
|
dd0a13f37 has quit IRC (Ping timeout: 268 seconds) |
19:07
🔗
|
mundus |
What's the best tool for large site archival? |
19:07
🔗
|
|
arkhive has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…) |
19:22
🔗
|
JAA |
mundus: Define "large"? |
19:23
🔗
|
mundus |
like a million pages |
19:23
🔗
|
JAA |
wpull can handle that easily, assuming you have sufficient disk space. |
19:23
🔗
|
mundus |
Okay |
19:23
🔗
|
mundus |
I was guessing wpull |
19:24
🔗
|
JAA |
Not sure if it's the "best" tool, but it works well. |
19:24
🔗
|
JAA |
I've ran multi-million URL archivals with wpull several times. |
19:24
🔗
|
mundus |
alright, what options do you normally use? |
19:25
🔗
|
JAA |
I think I mostly copied those used in ArchiveBot, then adapted them a bit in some cases. |
19:26
🔗
|
JAA |
https://github.com/ArchiveTeam/ArchiveBot/blob/a6e6da8ba37e733e4b10b7090b5fc4a6cffc9119/pipeline/archivebot/seesaw/wpull.py#L18-L53 |
19:26
🔗
|
mundus |
cool, thanks |
19:35
🔗
|
joepie91_ |
mundus: you may find grab-site useful also |
19:35
🔗
|
joepie91_ |
sort of like a local archivebot |
19:35
🔗
|
joepie91_ |
mundus: ref https://github.com/ludios/grab-site |
19:36
🔗
|
mundus |
oh nice |
19:47
🔗
|
second |
chfoo: do you have a doc explaining how wpull works with youtube-dl etc or how it should work? |
19:55
🔗
|
second |
How do I become a member of the ArchiveTeam and what would that mean? |
19:58
🔗
|
second |
JAA: is there a doc somewhere with how the IA archives things and keeps bacups? |
19:58
🔗
|
second |
backups |
19:59
🔗
|
|
etudier has joined #archiveteam-bs |
20:02
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
20:04
🔗
|
JAA |
second: You become a member by doing stuff that aligns with AT's activities. There isn't anything formal. |
20:06
🔗
|
JAA |
There is some stuff in the "help" section of archive.org, and also some blog entries. Not sure what else exists. |
20:07
🔗
|
JAA |
I don't think the individual archival strategies etc. are documented well (publicly) though. |
20:12
🔗
|
|
BartoCH has joined #archiveteam-bs |
20:21
🔗
|
jrwr |
second: anyone can do /something/ we are more of a method then anything, what do you want to do? |
20:26
🔗
|
|
kristian_ has quit IRC (Remote host closed the connection) |
20:26
🔗
|
second |
not sure, I'm more working on file categorization / curation right now |
20:27
🔗
|
second |
What kind of things shouldn't we archive? |
20:28
🔗
|
jrwr |
Well |
20:28
🔗
|
jrwr |
Thats a hard question |
20:29
🔗
|
jrwr |
If you are doing web archival, I would make sure to save everything as WARCs |
20:29
🔗
|
jrwr |
(wget supports this, so does wpull) |
20:30
🔗
|
jrwr |
Anything else, just do best quality you can. the more metadata the better |
20:30
🔗
|
jrwr |
make an account on IA and go to town uploading things |
20:31
🔗
|
jrwr |
check out SketchCow's IA and see how he uploads things |
20:31
🔗
|
jrwr |
(for things like CDs, Tapes, Paper) |
20:58
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
21:08
🔗
|
|
DFJustin has joined #archiveteam-bs |
21:08
🔗
|
|
swebb sets mode: +o DFJustin |
21:25
🔗
|
|
ZexaronS has quit IRC (Quit: Leaving) |
22:20
🔗
|
|
drumstick has joined #archiveteam-bs |
22:25
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
22:30
🔗
|
|
Soni has quit IRC (Ping timeout: 506 seconds) |
22:41
🔗
|
|
Soni has joined #archiveteam-bs |
22:41
🔗
|
second |
Does the internet archive have deduplication active? |
22:41
🔗
|
second |
I wouldn't want to upload a bunch of stuff and waste their space |
22:41
🔗
|
|
ZexaronS has joined #archiveteam-bs |
22:43
🔗
|
second |
JAA: has this been archived? https://www.reddit.com/r/opendirectories/comments/6zuk7v/alexandria_library_38029_ebooks_from_5268_author/ |
22:44
🔗
|
second |
https://alexandria-library.space/Ebooks/Author/ |
22:44
🔗
|
second |
https://alexandria-library.space/Ebooks/ComputerScience/ |
22:44
🔗
|
second |
https://alexandria-library.space/Images/ww2/north-american-aviation-world-war-2/ |
22:44
🔗
|
second |
https://alexandria-library.space/Images/ |
22:45
🔗
|
JAA |
Not yet, as far as I know, but arkiver just added them to ArchiveBot. |
22:46
🔗
|
arkiver |
yeah |
22:47
🔗
|
|
BartoCH has quit IRC (Quit: WeeChat 1.9) |
22:50
🔗
|
second |
Did you do it because I said something or was it already added? I'm wondering if you guys watch that and other reddit(s) |
22:51
🔗
|
second |
Is there an archive of scihub? |
22:53
🔗
|
JAA |
I watch some subreddits, but not opendirectories (yet). |
22:53
🔗
|
arkiver |
added because you said it |
22:53
🔗
|
arkiver |
it looks like something we want to archive |
22:54
🔗
|
JAA |
We were discussing libgen several times in the past few days. See the logs: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-17,Sun |
22:55
🔗
|
JAA |
Basically, at this point, I assume that IA has a darked copy of it, and even if they don't, the dataset won't disappear anytime soon and can still be archived *if* libgen actually gets in trouble. |
22:59
🔗
|
second |
Isn't libgen always possibly in trouble? |
22:59
🔗
|
second |
Different governments / institutions trying to shut it down |
22:59
🔗
|
second |
JAA are you Jason Scott? |
22:59
🔗
|
JAA |
Possible, but I wouldn't be worried about the data until libgen actually goes offline or similar. |
23:00
🔗
|
JAA |
The data is available in (active) torrents and on Usenet... |
23:00
🔗
|
JAA |
No, that's SketchCow. |
23:01
🔗
|
second |
How does one setup a Usenet account / get one, is there a guide somewhere? |
23:01
🔗
|
JAA |
First rule of Usenet... |
23:02
🔗
|
second |
Dammit |
23:02
🔗
|
JAA |
:-P |
23:02
🔗
|
JAA |
Check out /r/usenet. They have a ton of good information. |
23:03
🔗
|
second |
Will you guys archive porn? |
23:03
🔗
|
JAA |
Well, we did archive Eroshare, so there's that. |
23:04
🔗
|
|
Soni has quit IRC (Read error: Connection reset by peer) |
23:04
🔗
|
JAA |
There's also that 2 PB webcam archive by /u/Beaston02. |
23:04
🔗
|
second |
Eh, I found a wiki which list actors in porn but you need to login |
23:04
🔗
|
JAA |
That's not on IA though. |
23:05
🔗
|
second |
Can you archive it? |
23:05
🔗
|
second |
Why not? |
23:05
🔗
|
second |
All this stuff on the IA and the most viewed stuff in the art museum is vintage porn |
23:05
🔗
|
second |
http://95.31.3.127/pbc/Main_Page |
23:06
🔗
|
JAA |
Well, I don't think IA is interested in spending 3-4 million dollars over the next few years for random porn webcams. |
23:09
🔗
|
JAA |
(That number is based on https://twitter.com/textfiles/status/885527796583284741 ) |
23:11
🔗
|
second |
How do people archive 2PB of data?! |
23:11
🔗
|
JAA |
I'm not saying it shouldn't be archived. In general, my opinion is that everything should be kept. Unfortunately though, that's not very realistic, and I think there are more important things to preserve than random porn webcams. |
23:12
🔗
|
JAA |
Amazon Cloud Drive and now Google Drive. |
23:12
🔗
|
second |
Wait a minute, Jason Scott is the same guy behind textfiles.com, interesting |
23:12
🔗
|
JAA |
Some people suspect that ACD only killed the unlimited offer because of Beaston02 storing those webcam recordings there. |
23:13
🔗
|
second |
JAA: are there any upcoming store breakthroughs that you can think of? |
23:13
🔗
|
second |
Lol, "this is why we can't have nice things" |
23:14
🔗
|
|
ld1 has quit IRC (Read error: Connection reset by peer) |
23:17
🔗
|
JAA |
No idea really. HAMR will come, but that probably won't really reduce storage costs massively, i.e. not a real breakthrough. DNA storage is still far away, I guess. Otherwise, I don't really know too much about other technologies currently in development. |
23:20
🔗
|
|
ld1 has joined #archiveteam-bs |
23:32
🔗
|
|
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
23:35
🔗
|
|
etudier has joined #archiveteam-bs |
23:38
🔗
|
jrwr |
I think DNA might be a good ROM |
23:38
🔗
|
jrwr |
not WMRM |
23:41
🔗
|
jrwr |
or like old school tape drives |
23:49
🔗
|
JAA |
Yeah, it sounds pretty perfect for long-term archival. |