#archiveteam-bs 2017-09-17,Sun

↑back Search

Time Nickname Message
00:03 🔗 qwebirc57 has joined #archiveteam-bs
00:04 🔗 qwebirc57 unstable fucking piece of shit
00:04 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
00:04 🔗 qwebirc57 is now known as dd0a13f37
00:04 🔗 Honno has quit IRC (Read error: Operation timed out)
00:07 🔗 dd0a13f37 I missed anything?
00:14 🔗 dd0a13f3T has joined #archiveteam-bs
00:15 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
00:18 🔗 JAA Nah
00:21 🔗 refeed has joined #archiveteam-bs
00:22 🔗 dd0a13f3T is now known as dd0a13f37
00:43 🔗 drumstick has quit IRC (Read error: Operation timed out)
00:44 🔗 dd0a13f37 If something is on usenet, is it considered archived? And would it be a good idea to upload library genesis torrents to archive.org, or would that be considered wasting space/bandwidth for piracy?
00:53 🔗 JAA I've heard that there might be a copy of libgen at IA already (but not publicly available). Not sure if it's true though.
00:54 🔗 JAA And although Usenet is safe-ish, I wouldn't consider it archived. Stuff still disappears from it sooner or later.
00:54 🔗 dd0a13f37 You can upload a torrent to IA and have them download it, right?
00:54 🔗 JAA Yes, I believe so.
00:54 🔗 dd0a13f37 Then you could download their zip file of torrents, upload them to archive.org, then wait for them to pull it
00:54 🔗 dd0a13f37 But is it worth it? It's 30tb of data, and it will likely be hidden
00:56 🔗 dd0a13f37 The databases are archived
00:56 🔗 dd0a13f37 https://archive.org/details/libgen-meta-20150824
00:57 🔗 dd0a13f3 has joined #archiveteam-bs
00:57 🔗 BlueMaxim has joined #archiveteam-bs
01:01 🔗 JAA I wouldn't be surprised if either https://archive.org/details/librarygenesis or https://archive.org/details/gen-lib contained a full (hidden) archive.
01:01 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
01:04 🔗 dd0a13f3 Should I avoid uploading it, or will it recognize and deuplicate?
01:05 🔗 dd0a13f3 is now known as dd0a13f37
01:08 🔗 dd0a13f37 both of these are 3 years old, so they're outdated at any rate
01:08 🔗 godane so i'm going thur my web archives that i have not uploaded
01:09 🔗 godane or at least thought i uploaded and turned out i didn't
01:10 🔗 dd0a13f37 Okay, so if I have a url pointing to a zip file of torrents, can I just give them the URL?
01:12 🔗 dd0a13f37 No, apparently not. How does this "derive" stuff work, can I have them unpack a zip file for me?
01:17 🔗 JAA dd0a13f37: That's when the collection was created, not when any items in the collection were added/last updated.
01:17 🔗 JAA By the way, the graph for the number of items in the second collection of the two looks interesting...
01:17 🔗 dd0a13f37 Sure, but who would update such a collection?
01:18 🔗 JAA Someone from IA?
01:18 🔗 dd0a13f37 2k items is much too small, they have 2m books. Or is it the amount of folders?
01:20 🔗 JAA An item can hold an arbitrary number of directories and files (more or less, there seem to be some issues if the items get very large).
01:21 🔗 JAA If they have a copy, they certainly wouldn't throw it all into one item, and they also certainly wouldn't throw each book/article into its own item.
01:21 🔗 dd0a13f37 The torrents are folders named XXXX000, where XXXX is the unique identified (from 0-2092)
01:21 🔗 JAA Well, then 2k sounds about right?
01:21 🔗 dd0a13f37 So that could mean there are 2k different oflders
01:21 🔗 dd0a13f37 Yeah
01:22 🔗 dd0a13f37 Although, looking at the graph it seems more like 1.4k, or is it log?
01:24 🔗 * JAA shrugs
01:25 🔗 JAA Looks like it might be rounded, so the top of the graph is 1.5k.
01:25 🔗 godane i'm reuploading my images.g4tv.com dumps
01:26 🔗 dd0a13f37 Should I upload them again then?
01:26 🔗 dd0a13f37 They're also missing sci-mag, which is around 50tb
01:26 🔗 JAA Definitely ask IA about this first.
01:27 🔗 JAA But I doubt that that dataset is going to disappear anytime soon.
01:27 🔗 JAA There are certainly several copies stored in various places.
01:27 🔗 JAA (Including the ones publicly available via Usenet or torrents.
01:27 🔗 JAA )
01:28 🔗 dd0a13f37 Yes, that's true. The torrents are seeded, and various mirrors have more or less complete copies.
01:30 🔗 godane looks like i upload them nevermind
01:32 🔗 dd0a13f37 Sci-mag is worse off, but on the other hand they have sci-hub which has multiple servers run by people who are not subject to any jurisdiction
01:32 🔗 dd0a13f37 So both collections should be fine
01:51 🔗 drumstick has joined #archiveteam-bs
02:49 🔗 VADemon_ has quit IRC (left4dead)
02:57 🔗 hook54321 Should I check if a piece of software is already on archive.org before going through all my CDs?
03:06 🔗 dd0a13f37 To upload or to download?
03:07 🔗 dd0a13f37 If they're somehow part of a collection then it might not be such a huge deal
03:21 🔗 hook54321 What do you mean?
03:24 🔗 dd0a13f37 If you have some collection of software on 10 different disks that you bought as a bundle then it might have historical value as a whole even if all the software exists separately
03:49 🔗 hook54321 it's mostly single disks, bought separately.
03:57 🔗 dd0a13f37 Well, it can't be that much storage wasted even if you do upload it twice
03:57 🔗 dd0a13f37 could be different versions as well
03:58 🔗 hook54321 If it has a different cover then I would definitely upload it
04:02 🔗 drumstick has quit IRC (Read error: Operation timed out)
04:04 🔗 drumstick has joined #archiveteam-bs
04:28 🔗 hook54321 arkiver: I left the channel
04:46 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:52 🔗 Sk1d has joined #archiveteam-bs
04:59 🔗 refeed has quit IRC (Ping timeout: 600 seconds)
05:33 🔗 pizzaiolo has quit IRC (Quit: pizzaiolo)
05:33 🔗 refeed has joined #archiveteam-bs
06:05 🔗 icedice has quit IRC (Quit: Leaving)
06:07 🔗 Dimtree has quit IRC (Read error: Operation timed out)
06:57 🔗 hook54321 Did we grab all the duckduckgo stuff?
07:01 🔗 Dimtree has joined #archiveteam-bs
07:21 🔗 Soni has quit IRC (Ping timeout: 272 seconds)
07:28 🔗 Stilett0 has joined #archiveteam-bs
07:30 🔗 DFJustin has quit IRC (Remote host closed the connection)
07:34 🔗 DFJustin has joined #archiveteam-bs
07:34 🔗 swebb sets mode: +o DFJustin
08:17 🔗 Asparagir has quit IRC (Asparagir)
08:25 🔗 kristian_ has joined #archiveteam-bs
08:37 🔗 Honno has joined #archiveteam-bs
08:52 🔗 kristian_ has quit IRC (Quit: Leaving)
09:24 🔗 schbirid has joined #archiveteam-bs
09:27 🔗 refeed has quit IRC (Read error: Operation timed out)
09:35 🔗 tuluu has quit IRC (Read error: Operation timed out)
09:52 🔗 underscor has joined #archiveteam-bs
09:52 🔗 swebb sets mode: +o underscor
10:02 🔗 tuluu has joined #archiveteam-bs
10:15 🔗 BartoCH has joined #archiveteam-bs
10:29 🔗 zhongfu_ has quit IRC (Ping timeout: 260 seconds)
10:29 🔗 zhongfu has joined #archiveteam-bs
10:44 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
10:44 🔗 Mateon1 has joined #archiveteam-bs
11:00 🔗 noirscape has joined #archiveteam-bs
11:09 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:13 🔗 drumstick has quit IRC (Read error: Operation timed out)
11:19 🔗 joepie91_ hook54321: definitely upload it; if it turns out to be a duplicate it can always be removed later
11:19 🔗 joepie91_ hook54321: there are often many different editions of the same thing
11:26 🔗 Soni has joined #archiveteam-bs
11:36 🔗 pizzaiolo has joined #archiveteam-bs
11:48 🔗 tuluu_ has joined #archiveteam-bs
11:49 🔗 tuluu has quit IRC (Read error: Operation timed out)
12:14 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
12:33 🔗 JAA http://www.instructables.com/id/How-to-fix-a-Samsung-external-m3-hard-drive-in-und/ :-)
13:19 🔗 wp494 has quit IRC (Read error: Connection reset by peer)
13:20 🔗 wp494 has joined #archiveteam-bs
14:11 🔗 schbirid has quit IRC (Quit: Leaving)
14:17 🔗 etudier has joined #archiveteam-bs
14:21 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
15:19 🔗 etudier has quit IRC (Remote host closed the connection)
15:26 🔗 second They say archive.org did a faulty job of archiving something, but they have the new forums up, can you guys archive their backup? http://gamehacking.org/ Scroll down to news for Aug 10th
15:26 🔗 second Or I can archive it but where do I upload it to get it into the archive and what is the proper way to do so?
15:31 🔗 JAA second: Is GameHacking itself also in danger, or is this just about the WiiRd forum archive?
15:32 🔗 JAA Whatever. GH isn't that big anyway. I'll throw it into ArchiveBot.
15:35 🔗 JAA Scratch the "not that big", but it's worth archiving the entire thing. Looks like it has tons of useful resources.
15:39 🔗 mls has quit IRC (Read error: Connection reset by peer)
15:40 🔗 mls has joined #archiveteam-bs
15:56 🔗 second JAA: just the WiiRd forum
15:56 🔗 second JAA: you're going to have a hard time archiving the gamehacking parts though
15:57 🔗 second Lots of javascript on the page, I was doing it but chrome headless crashed with the setup I was using in docker w/ warcproxy
15:57 🔗 second I'll redo it when I get some time and hopefully when firefox headless comes out
15:57 🔗 second I have a juypter notebook with the code for doing it
15:58 🔗 second going through each page of the manuals and clicking expand
15:58 🔗 second If you can archive the other stuff / whatever you can that would be great because I'm only going for the cheat codes
15:58 🔗 second Very useful for emulators / games old and new
15:59 🔗 second There are some games which are pretty much unplayable without cheat codes because they required certain hardware things
15:59 🔗 second Think pokemon trading to evolve or Django the Solar boy requiring the litteral sun
15:59 🔗 JAA Hm, I haven't found anything that didn't work for me without JavaScript yet.
16:00 🔗 JAA Do you have an example?
16:02 🔗 second http://gamehacking.org/game/4366
16:02 🔗 second Click the down arrows on the side
16:02 🔗 JAA Ah yeah, just saw that now.
16:02 🔗 second They require javascript and outputs the codes for each cheat device
16:03 🔗 second Even includes notes
16:03 🔗 second Its too bad archivebot can't accept javascript to run on each page or something like selenium commands but archivebot doesn't even work like that from what I gather
16:03 🔗 second Its more like a distributed wget
16:04 🔗 second perhaps one day it can be upgrade to a very lite and small browser or even a proxy that a archive browser uses to hit pages
16:04 🔗 second Still a partial archive is better than no archive
16:04 🔗 second JAA: is there an archive of allrecipes?
16:05 🔗 second And are you adding gamehacking.org to the archive?
16:05 🔗 JAA ArchiveBot does have PhantomJS, but that doesn't work too well and wouldn't help in this case at all.
16:06 🔗 JAA Or to be precise, wpull supports PhantomJS, and ArchiveBot uses wpull internally.
16:06 🔗 second wpull hasn't been updated in the longest!
16:06 🔗 second And isn't taking pull requests either
16:06 🔗 JAA But that's just for scrolling and loading scripted stuff. It doesn't work for clicking on things etc.
16:06 🔗 JAA Yes, I know. chfoo's been pretty busy, from what I gathered.
16:06 🔗 second Is there a more updated version and does it work with youtube-dl now / still?
16:07 🔗 second hmm they are actually in here
16:07 🔗 JAA I know that youtube-dl is broken on at least most pipelines.
16:07 🔗 second They could try giving permissions for others to merge code in or push to the project
16:07 🔗 JAA No idea if it works when used directly with wpull.
16:08 🔗 JAA There's the fork by FalconK, which has a few bug fixes, but other than that I'm not aware of anyone working on it.
16:09 🔗 JAA I've been working on URL priorisation for a while now, but I haven't spent much time on it really.
16:09 🔗 JAA FalconK's also pretty busy currently, so yeah, nobody's even trying to maintain it.
16:11 🔗 second URL priorisation?
16:12 🔗 second What is everyone busy with?
16:13 🔗 second Is there a good way to save wikia websites?
16:13 🔗 second So I have a lot of questions, its not often I'm on efnet (maybe I'll fix that) and I've been interested in archiving for a long time
16:14 🔗 JAA https://gist.github.com/JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a
16:14 🔗 JAA This is what I wrote a while ago about my plans.
16:14 🔗 JAA It's semi-implemented, but there's still some stuff to do, in particular there is no plugin interface yet, which is necessary to then implement it into ArchiveBot (and grab-site).
16:15 🔗 JAA People are busy with real-life stuff, I guess.
16:16 🔗 JAA Wikia's just Mediawiki, isn't it? There are two ways to save that, either through WikiTeam (no idea how active that is) or through ArchiveBot.
16:16 🔗 second Can the archivebot archive a flakey site which requires login?
16:17 🔗 JAA And regarding your earlier questions: there is no record of an archive of allrecipes in ArchiveBot; someone shared a dump in here a few months ago, but that's not a proper archive and can't be included in the Wayback Machine.
16:18 🔗 JAA Yes, I added gamehacking.org to ArchiveBot.
16:18 🔗 second Yeah, I found that one
16:18 🔗 JAA No, login isn't supported by ArchiveBot.
16:18 🔗 JAA Neither is CloudFlare DDoS protection and stuff like that, by the way.
16:18 🔗 second dang, did not know about cloudflare
16:19 🔗 second Why not cloudflare?
16:19 🔗 second That is a lot of sites we can't archive then
16:19 🔗 JAA Just the DDoS protection bit, i.e. the "Checking your browser" message thingy.
16:19 🔗 JAA That requires you to solve a JS challenge...
16:20 🔗 JAA There was some discussion on this in here a few days ago.
16:25 🔗 second https://github.com/ArchiveTeam/ArchiveBot/issues/216
16:27 🔗 JAA Yes, but cloudflare-scrape is a really shitty and insecure solution.
16:28 🔗 JAA second: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-14,Thu&sel=124-150#l120
16:28 🔗 brayden has quit IRC (Read error: Connection reset by peer)
16:29 🔗 brayden has joined #archiveteam-bs
16:29 🔗 swebb sets mode: +o brayden
16:36 🔗 cf has quit IRC (Ping timeout: 260 seconds)
16:40 🔗 cf has joined #archiveteam-bs
16:51 🔗 etudier has joined #archiveteam-bs
17:24 🔗 Stilett0- has joined #archiveteam-bs
17:27 🔗 Stilett0- is now known as Stiletto
17:41 🔗 chfoo i haven't been feeling like maintaining wpull unfortunately :/ it became a big ball of code
17:44 🔗 kristian_ has joined #archiveteam-bs
17:46 🔗 dd0a13f37 has joined #archiveteam-bs
17:46 🔗 dd0a13f37 JAA: cloudflare whitelists tor using some strange voodoo magic (it's not just the user agent and it works without JS), can we utilize this somehow?
17:48 🔗 dd0a13f37 Or, well, it depends on the protection level, but for 90% you can browse Tor. It didn't use to be this way, and if you do "copy as curl" from dev tools and paste into terminal w/ torsocks you still get the warning page
17:53 🔗 JAA dd0a13f37: Interesting. If we knew more about it, we could perhaps use it, yes. I wonder how reliable it is though.
17:53 🔗 dd0a13f37 It could be details in SSL is handled
17:53 🔗 dd0a13f37 That seems like the only difference I can think of
17:53 🔗 JAA That would be painful to replicate.
17:54 🔗 balrog has quit IRC (Ping timeout: 1208 seconds)
17:54 🔗 JAA I guess implementing joepie91_'s code in a wpull plugin is probably easier.
17:54 🔗 dd0a13f37 Even if you do "new circuit for this site" and issue the request with a cookie that shouldn't be valid for that IP it still works
17:54 🔗 JAA How do you get that cookie initially?
17:54 🔗 dd0a13f37 Can't you just add a hook to get a valid cookie without changing any structure?
17:54 🔗 dd0a13f37 The site sets it
17:55 🔗 JAA Hm
17:55 🔗 dd0a13f37 You get a __cfduid cookie
17:55 🔗 dd0a13f37 when connecting to a cf site
17:55 🔗 JAA So the normal procedure, right.
17:55 🔗 dd0a13f37 Are those tied to IPs?
17:55 🔗 JAA Yeah, you could implement it as a hook, but the problem is that there is no proper implementation of a bypass.
17:56 🔗 dd0a13f37 Because if I copy the exact request and issue it with curl (same cookies, headers, ua) using torsocks it doesn't work
17:56 🔗 dd0a13f37 That's the spooky thing
17:56 🔗 dd0a13f37 What do you want to bypass? "one more step" or "please turn on js"?
17:56 🔗 JAA "Checking your browser"
17:57 🔗 dd0a13f37 Isn't there?
17:57 🔗 JAA Which is "please turn on JavaScript" if you have JS disabled.
17:57 🔗 JAA Not as far as I know.
17:57 🔗 dd0a13f37 So what does joepie91's code do?
17:57 🔗 balrog has joined #archiveteam-bs
17:57 🔗 swebb sets mode: +o balrog
17:58 🔗 JAA It parses the challenge and calculates the correct response without executing JavaScript.
17:58 🔗 dd0a13f37 Isn't that a bypass?
17:58 🔗 dd0a13f37 Or what exactly are you looking to do?
17:58 🔗 JAA Yes, it is.
17:59 🔗 JAA But it's written in JavaScript, not in Python.
17:59 🔗 JAA https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36
17:59 🔗 dd0a13f37 Modify it so it prints the cookie to stdout, then just do shell exec
17:59 🔗 dd0a13f37 easy solution
18:00 🔗 JAA Yeah, we'd like a pure-Python version so we can avoid installing NodeJS or equivalent.
18:00 🔗 JAA I mean, it might work on ArchiveBot where we have PhantomJS anyway, but it'd also be nice to have it in the warrior, for example.
18:00 🔗 dd0a13f37 Can't you set it up as a web service? Send challenge page-get response
18:00 🔗 dd0a13f37 You only need to do it once
18:00 🔗 JAA Huh, that's a nice idea actually.
18:01 🔗 JAA A CF protection cracker API :-)
18:01 🔗 dd0a13f37 """protection"""
18:01 🔗 dd0a13f37 """cracker"""
18:01 🔗 JAA Hehe
18:01 🔗 dd0a13f37 And what about https://github.com/Anorov/cloudflare-scrape ?
18:02 🔗 JAA That executes CF's code in NodeJS and is inherently insecure.
18:02 🔗 dd0a13f37 So it needs node?
18:02 🔗 JAA You can easily trick it into executing arbitrary code, i.e. use it for RCE.
18:02 🔗 JAA Yep
18:02 🔗 dd0a13f37 Oh ok
18:04 🔗 dd0a13f37 So how does the script work, does it take an entire page and return a cookie?
18:07 🔗 JAA Which script?
18:07 🔗 dd0a13f37 https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36
18:09 🔗 JAA I'm not sure. I've never used it, and I'm not familiar with using JavaScript like that (i.e. outside of a browser) at all.
18:10 🔗 dd0a13f37 Me neither
18:10 🔗 dd0a13f37 What is executed first? Or is it like a library, so you should look at the exports?
18:10 🔗 JAA As far as I can tell, the function in index.js takes the challenge site as an HTML string as the argument and throws out the relevant parts of the JS challenge that you need to combine somehow to get the response.
18:11 🔗 JAA The challenge looks like this, in case you're not familiar with it:
18:11 🔗 JAA fVbMmUH={"twaBkDiNOR":+((!+[]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]))};
18:11 🔗 JAA fVbMmUH.twaBkDiNOR-=+((+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));fVbMmUH.twaBkDiNOR*=+((!+[]+!![]+!![]+[])+...
18:12 🔗 JAA So you need to transform each of those JSFuck-like expressions into a number and then -=, *=, etc. those numbers to get the correct response.
18:12 🔗 dd0a13f37 Can't you just use a regex to sanitize it and then execute them unsafely?
18:12 🔗 JAA Hahaha, good luck sanitising JSFuck.
18:13 🔗 JAA I think cloudflare-scrape tries, but yeah...
18:13 🔗 dd0a13f37 Oh, it can execute code, not just return a value?
18:13 🔗 dd0a13f37 well then you're fucked
18:14 🔗 JAA Yeah. The code would be huge, but you can write *any* JS script with just the six characters ()[]+! used in the challenge.
18:15 🔗 JAA https://en.wikipedia.org/wiki/JSFuck
18:15 🔗 dd0a13f37 Was that an actual example or just randomly generated?
18:16 🔗 JAA That's an actual example.
18:16 🔗 dd0a13f37 Where can I find one?
18:16 🔗 dd0a13f37 A complete one
18:18 🔗 JAA https://gist.github.com/anonymous/85c9b2b57726135a2500a8425b370095
18:23 🔗 dd0a13f37 I don't understand the purpose
18:24 🔗 dd0a13f37 Anyone who wants to do evil stuff would just use one of those scripts, and they're using a botnet so they wouln't care about cloudflare infecting them
18:24 🔗 dd0a13f37 What's the point?
18:24 🔗 JAA Idk either
18:26 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
18:28 🔗 dd0a13f37 I don't get it, why can't you just use proxies for the really unfriendly sites?
18:28 🔗 Asparagir has joined #archiveteam-bs
18:29 🔗 JAA And by the way, it's not just about CloudFlare serving evil code. Anyone could easily trigger cloudflare-scrape from their own server with an appropriate response.
18:29 🔗 svchfoo3 sets mode: +o Asparagir
18:29 🔗 svchfoo1 sets mode: +o Asparagir
18:29 🔗 dd0a13f37 Well, I doubt you care about ACE when running a botnet
18:30 🔗 JAA Specifically: https://github.com/Anorov/cloudflare-scrape/blob/ee17a7a145990d6975de0be8d8bf5b0abbd87162/cfscrape/__init__.py#L41-L47
18:30 🔗 JAA Yeah, I just mean in general.
18:31 🔗 dd0a13f37 There are commercial proxy providers with clean IPs, the cost of renting a bunch would probably be cheaper than what you spend on hard drives
18:34 🔗 dd0a13f37 Got another response from itorrents, he said he would upload database to archive.org and send link, the other three still haven't responded
18:42 🔗 dd0a13f37 JAA: Looking at generated jsfuck code, it's usually very long
18:43 🔗 dd0a13f37 CF is quite short
18:43 🔗 dd0a13f37 so you should be able to use a regex and limit the length
18:44 🔗 dd0a13f37 for example encoding the character a is 846 chars encoded
18:45 🔗 dd0a13f37 http://www.jsfuck.com/
18:47 🔗 dd0a13f37 And CF's brackets are always empty - [], jsfuck needs to have something inside to eval
18:48 🔗 JAA Yeah, I'm aware of that. It's still sloppy though.
18:49 🔗 dd0a13f37 It should be safe though
18:49 🔗 JAA I don't think you strictly need something inside the brackets to do things in JSFuck, but it probably helps shorten the obfuscated code.
18:50 🔗 dd0a13f37 You can never get the eval() you need to do bad things
18:50 🔗 dd0a13f37 It shouldn't be turing complete
18:53 🔗 JAA Possible
18:53 🔗 JAA I don't really know enough about JSFuck to say for sure.
18:57 🔗 arkhive has joined #archiveteam-bs
18:57 🔗 dd0a13f37 https://esolangs.org/wiki/JSFuck
18:58 🔗 dd0a13f37 it needs a big blob which is not possible to encode in under a certain amount of characters, it's ugly as fuck but it should be safe
18:59 🔗 dd0a13f37 the eval blob is 831 characters, so if you set an upper limit at 200 you should be fine
19:02 🔗 etudier has joined #archiveteam-bs
19:06 🔗 etudier has quit IRC (Client Quit)
19:06 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
19:07 🔗 mundus What's the best tool for large site archival?
19:07 🔗 arkhive has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…)
19:22 🔗 JAA mundus: Define "large"?
19:23 🔗 mundus like a million pages
19:23 🔗 JAA wpull can handle that easily, assuming you have sufficient disk space.
19:23 🔗 mundus Okay
19:23 🔗 mundus I was guessing wpull
19:24 🔗 JAA Not sure if it's the "best" tool, but it works well.
19:24 🔗 JAA I've ran multi-million URL archivals with wpull several times.
19:24 🔗 mundus alright, what options do you normally use?
19:25 🔗 JAA I think I mostly copied those used in ArchiveBot, then adapted them a bit in some cases.
19:26 🔗 JAA https://github.com/ArchiveTeam/ArchiveBot/blob/a6e6da8ba37e733e4b10b7090b5fc4a6cffc9119/pipeline/archivebot/seesaw/wpull.py#L18-L53
19:26 🔗 mundus cool, thanks
19:35 🔗 joepie91_ mundus: you may find grab-site useful also
19:35 🔗 joepie91_ sort of like a local archivebot
19:35 🔗 joepie91_ mundus: ref https://github.com/ludios/grab-site
19:36 🔗 mundus oh nice
19:47 🔗 second chfoo: do you have a doc explaining how wpull works with youtube-dl etc or how it should work?
19:55 🔗 second How do I become a member of the ArchiveTeam and what would that mean?
19:58 🔗 second JAA: is there a doc somewhere with how the IA archives things and keeps bacups?
19:58 🔗 second backups
19:59 🔗 etudier has joined #archiveteam-bs
20:02 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
20:04 🔗 JAA second: You become a member by doing stuff that aligns with AT's activities. There isn't anything formal.
20:06 🔗 JAA There is some stuff in the "help" section of archive.org, and also some blog entries. Not sure what else exists.
20:07 🔗 JAA I don't think the individual archival strategies etc. are documented well (publicly) though.
20:12 🔗 BartoCH has joined #archiveteam-bs
20:21 🔗 jrwr second: anyone can do /something/ we are more of a method then anything, what do you want to do?
20:26 🔗 kristian_ has quit IRC (Remote host closed the connection)
20:26 🔗 second not sure, I'm more working on file categorization / curation right now
20:27 🔗 second What kind of things shouldn't we archive?
20:28 🔗 jrwr Well
20:28 🔗 jrwr Thats a hard question
20:29 🔗 jrwr If you are doing web archival, I would make sure to save everything as WARCs
20:29 🔗 jrwr (wget supports this, so does wpull)
20:30 🔗 jrwr Anything else, just do best quality you can. the more metadata the better
20:30 🔗 jrwr make an account on IA and go to town uploading things
20:31 🔗 jrwr check out SketchCow's IA and see how he uploads things
20:31 🔗 jrwr (for things like CDs, Tapes, Paper)
20:58 🔗 DFJustin has quit IRC (Remote host closed the connection)
21:08 🔗 DFJustin has joined #archiveteam-bs
21:08 🔗 swebb sets mode: +o DFJustin
21:25 🔗 ZexaronS has quit IRC (Quit: Leaving)
22:20 🔗 drumstick has joined #archiveteam-bs
22:25 🔗 Honno has quit IRC (Read error: Operation timed out)
22:30 🔗 Soni has quit IRC (Ping timeout: 506 seconds)
22:41 🔗 Soni has joined #archiveteam-bs
22:41 🔗 second Does the internet archive have deduplication active?
22:41 🔗 second I wouldn't want to upload a bunch of stuff and waste their space
22:41 🔗 ZexaronS has joined #archiveteam-bs
22:43 🔗 second JAA: has this been archived? https://www.reddit.com/r/opendirectories/comments/6zuk7v/alexandria_library_38029_ebooks_from_5268_author/
22:44 🔗 second https://alexandria-library.space/Ebooks/Author/
22:44 🔗 second https://alexandria-library.space/Ebooks/ComputerScience/
22:44 🔗 second https://alexandria-library.space/Images/ww2/north-american-aviation-world-war-2/
22:44 🔗 second https://alexandria-library.space/Images/
22:45 🔗 JAA Not yet, as far as I know, but arkiver just added them to ArchiveBot.
22:46 🔗 arkiver yeah
22:47 🔗 BartoCH has quit IRC (Quit: WeeChat 1.9)
22:50 🔗 second Did you do it because I said something or was it already added? I'm wondering if you guys watch that and other reddit(s)
22:51 🔗 second Is there an archive of scihub?
22:53 🔗 JAA I watch some subreddits, but not opendirectories (yet).
22:53 🔗 arkiver added because you said it
22:53 🔗 arkiver it looks like something we want to archive
22:54 🔗 JAA We were discussing libgen several times in the past few days. See the logs: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-17,Sun
22:55 🔗 JAA Basically, at this point, I assume that IA has a darked copy of it, and even if they don't, the dataset won't disappear anytime soon and can still be archived *if* libgen actually gets in trouble.
22:59 🔗 second Isn't libgen always possibly in trouble?
22:59 🔗 second Different governments / institutions trying to shut it down
22:59 🔗 second JAA are you Jason Scott?
22:59 🔗 JAA Possible, but I wouldn't be worried about the data until libgen actually goes offline or similar.
23:00 🔗 JAA The data is available in (active) torrents and on Usenet...
23:00 🔗 JAA No, that's SketchCow.
23:01 🔗 second How does one setup a Usenet account / get one, is there a guide somewhere?
23:01 🔗 JAA First rule of Usenet...
23:02 🔗 second Dammit
23:02 🔗 JAA :-P
23:02 🔗 JAA Check out /r/usenet. They have a ton of good information.
23:03 🔗 second Will you guys archive porn?
23:03 🔗 JAA Well, we did archive Eroshare, so there's that.
23:04 🔗 Soni has quit IRC (Read error: Connection reset by peer)
23:04 🔗 JAA There's also that 2 PB webcam archive by /u/Beaston02.
23:04 🔗 second Eh, I found a wiki which list actors in porn but you need to login
23:04 🔗 JAA That's not on IA though.
23:05 🔗 second Can you archive it?
23:05 🔗 second Why not?
23:05 🔗 second All this stuff on the IA and the most viewed stuff in the art museum is vintage porn
23:05 🔗 second http://95.31.3.127/pbc/Main_Page
23:06 🔗 JAA Well, I don't think IA is interested in spending 3-4 million dollars over the next few years for random porn webcams.
23:09 🔗 JAA (That number is based on https://twitter.com/textfiles/status/885527796583284741 )
23:11 🔗 second How do people archive 2PB of data?!
23:11 🔗 JAA I'm not saying it shouldn't be archived. In general, my opinion is that everything should be kept. Unfortunately though, that's not very realistic, and I think there are more important things to preserve than random porn webcams.
23:12 🔗 JAA Amazon Cloud Drive and now Google Drive.
23:12 🔗 second Wait a minute, Jason Scott is the same guy behind textfiles.com, interesting
23:12 🔗 JAA Some people suspect that ACD only killed the unlimited offer because of Beaston02 storing those webcam recordings there.
23:13 🔗 second JAA: are there any upcoming store breakthroughs that you can think of?
23:13 🔗 second Lol, "this is why we can't have nice things"
23:14 🔗 ld1 has quit IRC (Read error: Connection reset by peer)
23:17 🔗 JAA No idea really. HAMR will come, but that probably won't really reduce storage costs massively, i.e. not a real breakthrough. DNA storage is still far away, I guess. Otherwise, I don't really know too much about other technologies currently in development.
23:20 🔗 ld1 has joined #archiveteam-bs
23:32 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
23:35 🔗 etudier has joined #archiveteam-bs
23:38 🔗 jrwr I think DNA might be a good ROM
23:38 🔗 jrwr not WMRM
23:41 🔗 jrwr or like old school tape drives
23:49 🔗 JAA Yeah, it sounds pretty perfect for long-term archival.

irclogger-viewer