#archiveteam-bs 2017-09-17,Sun

↑back Search

Time	Nickname	Message
00:03 ^🔗		qwebirc57 has joined #archiveteam-bs
00:04 ^🔗	qwebirc57	unstable fucking piece of shit
00:04 ^🔗		dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
00:04 ^🔗		qwebirc57 is now known as dd0a13f37
00:04 ^🔗		Honno has quit IRC (Read error: Operation timed out)
00:07 ^🔗	dd0a13f37	I missed anything?
00:14 ^🔗		dd0a13f3T has joined #archiveteam-bs
00:15 ^🔗		dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
00:18 ^🔗	JAA	Nah
00:21 ^🔗		refeed has joined #archiveteam-bs
00:22 ^🔗		dd0a13f3T is now known as dd0a13f37
00:43 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
00:44 ^🔗	dd0a13f37	If something is on usenet, is it considered archived? And would it be a good idea to upload library genesis torrents to archive.org, or would that be considered wasting space/bandwidth for piracy?
00:53 ^🔗	JAA	I've heard that there might be a copy of libgen at IA already (but not publicly available). Not sure if it's true though.
00:54 ^🔗	JAA	And although Usenet is safe-ish, I wouldn't consider it archived. Stuff still disappears from it sooner or later.
00:54 ^🔗	dd0a13f37	You can upload a torrent to IA and have them download it, right?
00:54 ^🔗	JAA	Yes, I believe so.
00:54 ^🔗	dd0a13f37	Then you could download their zip file of torrents, upload them to archive.org, then wait for them to pull it
00:54 ^🔗	dd0a13f37	But is it worth it? It's 30tb of data, and it will likely be hidden
00:56 ^🔗	dd0a13f37	The databases are archived
00:56 ^🔗	dd0a13f37	https://archive.org/details/libgen-meta-20150824
00:57 ^🔗		dd0a13f3 has joined #archiveteam-bs
00:57 ^🔗		BlueMaxim has joined #archiveteam-bs
01:01 ^🔗	JAA	I wouldn't be surprised if either https://archive.org/details/librarygenesis or https://archive.org/details/gen-lib contained a full (hidden) archive.
01:01 ^🔗		dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
01:04 ^🔗	dd0a13f3	Should I avoid uploading it, or will it recognize and deuplicate?
01:05 ^🔗		dd0a13f3 is now known as dd0a13f37
01:08 ^🔗	dd0a13f37	both of these are 3 years old, so they're outdated at any rate
01:08 ^🔗	godane	so i'm going thur my web archives that i have not uploaded
01:09 ^🔗	godane	or at least thought i uploaded and turned out i didn't
01:10 ^🔗	dd0a13f37	Okay, so if I have a url pointing to a zip file of torrents, can I just give them the URL?
01:12 ^🔗	dd0a13f37	No, apparently not. How does this "derive" stuff work, can I have them unpack a zip file for me?
01:17 ^🔗	JAA	dd0a13f37: That's when the collection was created, not when any items in the collection were added/last updated.
01:17 ^🔗	JAA	By the way, the graph for the number of items in the second collection of the two looks interesting...
01:17 ^🔗	dd0a13f37	Sure, but who would update such a collection?
01:18 ^🔗	JAA	Someone from IA?
01:18 ^🔗	dd0a13f37	2k items is much too small, they have 2m books. Or is it the amount of folders?
01:20 ^🔗	JAA	An item can hold an arbitrary number of directories and files (more or less, there seem to be some issues if the items get very large).
01:21 ^🔗	JAA	If they have a copy, they certainly wouldn't throw it all into one item, and they also certainly wouldn't throw each book/article into its own item.
01:21 ^🔗	dd0a13f37	The torrents are folders named XXXX000, where XXXX is the unique identified (from 0-2092)
01:21 ^🔗	JAA	Well, then 2k sounds about right?
01:21 ^🔗	dd0a13f37	So that could mean there are 2k different oflders
01:21 ^🔗	dd0a13f37	Yeah
01:22 ^🔗	dd0a13f37	Although, looking at the graph it seems more like 1.4k, or is it log?
01:24 ^🔗	*	JAA shrugs
01:25 ^🔗	JAA	Looks like it might be rounded, so the top of the graph is 1.5k.
01:25 ^🔗	godane	i'm reuploading my images.g4tv.com dumps
01:26 ^🔗	dd0a13f37	Should I upload them again then?
01:26 ^🔗	dd0a13f37	They're also missing sci-mag, which is around 50tb
01:26 ^🔗	JAA	Definitely ask IA about this first.
01:27 ^🔗	JAA	But I doubt that that dataset is going to disappear anytime soon.
01:27 ^🔗	JAA	There are certainly several copies stored in various places.
01:27 ^🔗	JAA	(Including the ones publicly available via Usenet or torrents.
01:27 ^🔗	JAA	)
01:28 ^🔗	dd0a13f37	Yes, that's true. The torrents are seeded, and various mirrors have more or less complete copies.
01:30 ^🔗	godane	looks like i upload them nevermind
01:32 ^🔗	dd0a13f37	Sci-mag is worse off, but on the other hand they have sci-hub which has multiple servers run by people who are not subject to any jurisdiction
01:32 ^🔗	dd0a13f37	So both collections should be fine
01:51 ^🔗		drumstick has joined #archiveteam-bs
02:49 ^🔗		VADemon_ has quit IRC (left4dead)
02:57 ^🔗	hook54321	Should I check if a piece of software is already on archive.org before going through all my CDs?
03:06 ^🔗	dd0a13f37	To upload or to download?
03:07 ^🔗	dd0a13f37	If they're somehow part of a collection then it might not be such a huge deal
03:21 ^🔗	hook54321	What do you mean?
03:24 ^🔗	dd0a13f37	If you have some collection of software on 10 different disks that you bought as a bundle then it might have historical value as a whole even if all the software exists separately
03:49 ^🔗	hook54321	it's mostly single disks, bought separately.
03:57 ^🔗	dd0a13f37	Well, it can't be that much storage wasted even if you do upload it twice
03:57 ^🔗	dd0a13f37	could be different versions as well
03:58 ^🔗	hook54321	If it has a different cover then I would definitely upload it
04:02 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
04:04 ^🔗		drumstick has joined #archiveteam-bs
04:28 ^🔗	hook54321	arkiver: I left the channel
04:46 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
04:52 ^🔗		Sk1d has joined #archiveteam-bs
04:59 ^🔗		refeed has quit IRC (Ping timeout: 600 seconds)
05:33 ^🔗		pizzaiolo has quit IRC (Quit: pizzaiolo)
05:33 ^🔗		refeed has joined #archiveteam-bs
06:05 ^🔗		icedice has quit IRC (Quit: Leaving)
06:07 ^🔗		Dimtree has quit IRC (Read error: Operation timed out)
06:57 ^🔗	hook54321	Did we grab all the duckduckgo stuff?
07:01 ^🔗		Dimtree has joined #archiveteam-bs
07:21 ^🔗		Soni has quit IRC (Ping timeout: 272 seconds)
07:28 ^🔗		Stilett0 has joined #archiveteam-bs
07:30 ^🔗		DFJustin has quit IRC (Remote host closed the connection)
07:34 ^🔗		DFJustin has joined #archiveteam-bs
07:34 ^🔗		swebb sets mode: +o DFJustin
08:17 ^🔗		Asparagir has quit IRC (Asparagir)
08:25 ^🔗		kristian_ has joined #archiveteam-bs
08:37 ^🔗		Honno has joined #archiveteam-bs
08:52 ^🔗		kristian_ has quit IRC (Quit: Leaving)
09:24 ^🔗		schbirid has joined #archiveteam-bs
09:27 ^🔗		refeed has quit IRC (Read error: Operation timed out)
09:35 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
09:52 ^🔗		underscor has joined #archiveteam-bs
09:52 ^🔗		swebb sets mode: +o underscor
10:02 ^🔗		tuluu has joined #archiveteam-bs
10:15 ^🔗		BartoCH has joined #archiveteam-bs
10:29 ^🔗		zhongfu_ has quit IRC (Ping timeout: 260 seconds)
10:29 ^🔗		zhongfu has joined #archiveteam-bs
10:44 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
10:44 ^🔗		Mateon1 has joined #archiveteam-bs
11:00 ^🔗		noirscape has joined #archiveteam-bs
11:09 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:13 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
11:19 ^🔗	joepie91_	hook54321: definitely upload it; if it turns out to be a duplicate it can always be removed later
11:19 ^🔗	joepie91_	hook54321: there are often many different editions of the same thing
11:26 ^🔗		Soni has joined #archiveteam-bs
11:36 ^🔗		pizzaiolo has joined #archiveteam-bs
11:48 ^🔗		tuluu_ has joined #archiveteam-bs
11:49 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
12:14 ^🔗		dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
12:33 ^🔗	JAA	http://www.instructables.com/id/How-to-fix-a-Samsung-external-m3-hard-drive-in-und/ :-)
13:19 ^🔗		wp494 has quit IRC (Read error: Connection reset by peer)
13:20 ^🔗		wp494 has joined #archiveteam-bs
14:11 ^🔗		schbirid has quit IRC (Quit: Leaving)
14:17 ^🔗		etudier has joined #archiveteam-bs
14:21 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
15:19 ^🔗		etudier has quit IRC (Remote host closed the connection)
15:26 ^🔗	second	They say archive.org did a faulty job of archiving something, but they have the new forums up, can you guys archive their backup? http://gamehacking.org/ Scroll down to news for Aug 10th
15:26 ^🔗	second	Or I can archive it but where do I upload it to get it into the archive and what is the proper way to do so?
15:31 ^🔗	JAA	second: Is GameHacking itself also in danger, or is this just about the WiiRd forum archive?
15:32 ^🔗	JAA	Whatever. GH isn't that big anyway. I'll throw it into ArchiveBot.
15:35 ^🔗	JAA	Scratch the "not that big", but it's worth archiving the entire thing. Looks like it has tons of useful resources.
15:39 ^🔗		mls has quit IRC (Read error: Connection reset by peer)
15:40 ^🔗		mls has joined #archiveteam-bs
15:56 ^🔗	second	JAA: just the WiiRd forum
15:56 ^🔗	second	JAA: you're going to have a hard time archiving the gamehacking parts though
15:57 ^🔗	second	Lots of javascript on the page, I was doing it but chrome headless crashed with the setup I was using in docker w/ warcproxy
15:57 ^🔗	second	I'll redo it when I get some time and hopefully when firefox headless comes out
15:57 ^🔗	second	I have a juypter notebook with the code for doing it
15:58 ^🔗	second	going through each page of the manuals and clicking expand
15:58 ^🔗	second	If you can archive the other stuff / whatever you can that would be great because I'm only going for the cheat codes
15:58 ^🔗	second	Very useful for emulators / games old and new
15:59 ^🔗	second	There are some games which are pretty much unplayable without cheat codes because they required certain hardware things
15:59 ^🔗	second	Think pokemon trading to evolve or Django the Solar boy requiring the litteral sun
15:59 ^🔗	JAA	Hm, I haven't found anything that didn't work for me without JavaScript yet.
16:00 ^🔗	JAA	Do you have an example?
16:02 ^🔗	second	http://gamehacking.org/game/4366
16:02 ^🔗	second	Click the down arrows on the side
16:02 ^🔗	JAA	Ah yeah, just saw that now.
16:02 ^🔗	second	They require javascript and outputs the codes for each cheat device
16:03 ^🔗	second	Even includes notes
16:03 ^🔗	second	Its too bad archivebot can't accept javascript to run on each page or something like selenium commands but archivebot doesn't even work like that from what I gather
16:03 ^🔗	second	Its more like a distributed wget
16:04 ^🔗	second	perhaps one day it can be upgrade to a very lite and small browser or even a proxy that a archive browser uses to hit pages
16:04 ^🔗	second	Still a partial archive is better than no archive
16:04 ^🔗	second	JAA: is there an archive of allrecipes?
16:05 ^🔗	second	And are you adding gamehacking.org to the archive?
16:05 ^🔗	JAA	ArchiveBot does have PhantomJS, but that doesn't work too well and wouldn't help in this case at all.
16:06 ^🔗	JAA	Or to be precise, wpull supports PhantomJS, and ArchiveBot uses wpull internally.
16:06 ^🔗	second	wpull hasn't been updated in the longest!
16:06 ^🔗	second	And isn't taking pull requests either
16:06 ^🔗	JAA	But that's just for scrolling and loading scripted stuff. It doesn't work for clicking on things etc.
16:06 ^🔗	JAA	Yes, I know. chfoo's been pretty busy, from what I gathered.
16:06 ^🔗	second	Is there a more updated version and does it work with youtube-dl now / still?
16:07 ^🔗	second	hmm they are actually in here
16:07 ^🔗	JAA	I know that youtube-dl is broken on at least most pipelines.
16:07 ^🔗	second	They could try giving permissions for others to merge code in or push to the project
16:07 ^🔗	JAA	No idea if it works when used directly with wpull.
16:08 ^🔗	JAA	There's the fork by FalconK, which has a few bug fixes, but other than that I'm not aware of anyone working on it.
16:09 ^🔗	JAA	I've been working on URL priorisation for a while now, but I haven't spent much time on it really.
16:09 ^🔗	JAA	FalconK's also pretty busy currently, so yeah, nobody's even trying to maintain it.
16:11 ^🔗	second	URL priorisation?
16:12 ^🔗	second	What is everyone busy with?
16:13 ^🔗	second	Is there a good way to save wikia websites?
16:13 ^🔗	second	So I have a lot of questions, its not often I'm on efnet (maybe I'll fix that) and I've been interested in archiving for a long time
16:14 ^🔗	JAA	https://gist.github.com/JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a
16:14 ^🔗	JAA	This is what I wrote a while ago about my plans.
16:14 ^🔗	JAA	It's semi-implemented, but there's still some stuff to do, in particular there is no plugin interface yet, which is necessary to then implement it into ArchiveBot (and grab-site).
16:15 ^🔗	JAA	People are busy with real-life stuff, I guess.
16:16 ^🔗	JAA	Wikia's just Mediawiki, isn't it? There are two ways to save that, either through WikiTeam (no idea how active that is) or through ArchiveBot.
16:16 ^🔗	second	Can the archivebot archive a flakey site which requires login?
16:17 ^🔗	JAA	And regarding your earlier questions: there is no record of an archive of allrecipes in ArchiveBot; someone shared a dump in here a few months ago, but that's not a proper archive and can't be included in the Wayback Machine.
16:18 ^🔗	JAA	Yes, I added gamehacking.org to ArchiveBot.
16:18 ^🔗	second	Yeah, I found that one
16:18 ^🔗	JAA	No, login isn't supported by ArchiveBot.
16:18 ^🔗	JAA	Neither is CloudFlare DDoS protection and stuff like that, by the way.
16:18 ^🔗	second	dang, did not know about cloudflare
16:19 ^🔗	second	Why not cloudflare?
16:19 ^🔗	second	That is a lot of sites we can't archive then
16:19 ^🔗	JAA	Just the DDoS protection bit, i.e. the "Checking your browser" message thingy.
16:19 ^🔗	JAA	That requires you to solve a JS challenge...
16:20 ^🔗	JAA	There was some discussion on this in here a few days ago.
16:25 ^🔗	second	https://github.com/ArchiveTeam/ArchiveBot/issues/216
16:27 ^🔗	JAA	Yes, but cloudflare-scrape is a really shitty and insecure solution.
16:28 ^🔗	JAA	second: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-14,Thu&sel=124-150#l120
16:28 ^🔗		brayden has quit IRC (Read error: Connection reset by peer)
16:29 ^🔗		brayden has joined #archiveteam-bs
16:29 ^🔗		swebb sets mode: +o brayden
16:36 ^🔗		cf has quit IRC (Ping timeout: 260 seconds)
16:40 ^🔗		cf has joined #archiveteam-bs
16:51 ^🔗		etudier has joined #archiveteam-bs
17:24 ^🔗		Stilett0- has joined #archiveteam-bs
17:27 ^🔗		Stilett0- is now known as Stiletto
17:41 ^🔗	chfoo	i haven't been feeling like maintaining wpull unfortunately :/ it became a big ball of code
17:44 ^🔗		kristian_ has joined #archiveteam-bs
17:46 ^🔗		dd0a13f37 has joined #archiveteam-bs
17:46 ^🔗	dd0a13f37	JAA: cloudflare whitelists tor using some strange voodoo magic (it's not just the user agent and it works without JS), can we utilize this somehow?
17:48 ^🔗	dd0a13f37	Or, well, it depends on the protection level, but for 90% you can browse Tor. It didn't use to be this way, and if you do "copy as curl" from dev tools and paste into terminal w/ torsocks you still get the warning page
17:53 ^🔗	JAA	dd0a13f37: Interesting. If we knew more about it, we could perhaps use it, yes. I wonder how reliable it is though.
17:53 ^🔗	dd0a13f37	It could be details in SSL is handled
17:53 ^🔗	dd0a13f37	That seems like the only difference I can think of
17:53 ^🔗	JAA	That would be painful to replicate.
17:54 ^🔗		balrog has quit IRC (Ping timeout: 1208 seconds)
17:54 ^🔗	JAA	I guess implementing joepie91_'s code in a wpull plugin is probably easier.
17:54 ^🔗	dd0a13f37	Even if you do "new circuit for this site" and issue the request with a cookie that shouldn't be valid for that IP it still works
17:54 ^🔗	JAA	How do you get that cookie initially?
17:54 ^🔗	dd0a13f37	Can't you just add a hook to get a valid cookie without changing any structure?
17:54 ^🔗	dd0a13f37	The site sets it
17:55 ^🔗	JAA	Hm
17:55 ^🔗	dd0a13f37	You get a __cfduid cookie
17:55 ^🔗	dd0a13f37	when connecting to a cf site
17:55 ^🔗	JAA	So the normal procedure, right.
17:55 ^🔗	dd0a13f37	Are those tied to IPs?
17:55 ^🔗	JAA	Yeah, you could implement it as a hook, but the problem is that there is no proper implementation of a bypass.
17:56 ^🔗	dd0a13f37	Because if I copy the exact request and issue it with curl (same cookies, headers, ua) using torsocks it doesn't work
17:56 ^🔗	dd0a13f37	That's the spooky thing
17:56 ^🔗	dd0a13f37	What do you want to bypass? "one more step" or "please turn on js"?
17:56 ^🔗	JAA	"Checking your browser"
17:57 ^🔗	dd0a13f37	Isn't there?
17:57 ^🔗	JAA	Which is "please turn on JavaScript" if you have JS disabled.
17:57 ^🔗	JAA	Not as far as I know.
17:57 ^🔗	dd0a13f37	So what does joepie91's code do?
17:57 ^🔗		balrog has joined #archiveteam-bs
17:57 ^🔗		swebb sets mode: +o balrog
17:58 ^🔗	JAA	It parses the challenge and calculates the correct response without executing JavaScript.
17:58 ^🔗	dd0a13f37	Isn't that a bypass?
17:58 ^🔗	dd0a13f37	Or what exactly are you looking to do?
17:58 ^🔗	JAA	Yes, it is.
17:59 ^🔗	JAA	But it's written in JavaScript, not in Python.
17:59 ^🔗	JAA	https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36
17:59 ^🔗	dd0a13f37	Modify it so it prints the cookie to stdout, then just do shell exec
17:59 ^🔗	dd0a13f37	easy solution
18:00 ^🔗	JAA	Yeah, we'd like a pure-Python version so we can avoid installing NodeJS or equivalent.
18:00 ^🔗	JAA	I mean, it might work on ArchiveBot where we have PhantomJS anyway, but it'd also be nice to have it in the warrior, for example.
18:00 ^🔗	dd0a13f37	Can't you set it up as a web service? Send challenge page-get response
18:00 ^🔗	dd0a13f37	You only need to do it once
18:00 ^🔗	JAA	Huh, that's a nice idea actually.
18:01 ^🔗	JAA	A CF protection cracker API :-)
18:01 ^🔗	dd0a13f37	"""protection"""
18:01 ^🔗	dd0a13f37	"""cracker"""
18:01 ^🔗	JAA	Hehe
18:01 ^🔗	dd0a13f37	And what about https://github.com/Anorov/cloudflare-scrape ?
18:02 ^🔗	JAA	That executes CF's code in NodeJS and is inherently insecure.
18:02 ^🔗	dd0a13f37	So it needs node?
18:02 ^🔗	JAA	You can easily trick it into executing arbitrary code, i.e. use it for RCE.
18:02 ^🔗	JAA	Yep
18:02 ^🔗	dd0a13f37	Oh ok
18:04 ^🔗	dd0a13f37	So how does the script work, does it take an entire page and return a cookie?
18:07 ^🔗	JAA	Which script?
18:07 ^🔗	dd0a13f37	https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36
18:09 ^🔗	JAA	I'm not sure. I've never used it, and I'm not familiar with using JavaScript like that (i.e. outside of a browser) at all.
18:10 ^🔗	dd0a13f37	Me neither
18:10 ^🔗	dd0a13f37	What is executed first? Or is it like a library, so you should look at the exports?
18:10 ^🔗	JAA	As far as I can tell, the function in index.js takes the challenge site as an HTML string as the argument and throws out the relevant parts of the JS challenge that you need to combine somehow to get the response.
18:11 ^🔗	JAA	The challenge looks like this, in case you're not familiar with it:
18:11 ^🔗	JAA	fVbMmUH={"twaBkDiNOR":+((!+[]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]))};
18:11 ^🔗	JAA	fVbMmUH.twaBkDiNOR-=+((+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));fVbMmUH.twaBkDiNOR*=+((!+[]+!![]+!![]+[])+...
18:12 ^🔗	JAA	So you need to transform each of those JSFuck-like expressions into a number and then -=, *=, etc. those numbers to get the correct response.
18:12 ^🔗	dd0a13f37	Can't you just use a regex to sanitize it and then execute them unsafely?
18:12 ^🔗	JAA	Hahaha, good luck sanitising JSFuck.
18:13 ^🔗	JAA	I think cloudflare-scrape tries, but yeah...
18:13 ^🔗	dd0a13f37	Oh, it can execute code, not just return a value?
18:13 ^🔗	dd0a13f37	well then you're fucked
18:14 ^🔗	JAA	Yeah. The code would be huge, but you can write any JS script with just the six characters ()[]+! used in the challenge.
18:15 ^🔗	JAA	https://en.wikipedia.org/wiki/JSFuck
18:15 ^🔗	dd0a13f37	Was that an actual example or just randomly generated?
18:16 ^🔗	JAA	That's an actual example.
18:16 ^🔗	dd0a13f37	Where can I find one?
18:16 ^🔗	dd0a13f37	A complete one
18:18 ^🔗	JAA	https://gist.github.com/anonymous/85c9b2b57726135a2500a8425b370095
18:23 ^🔗	dd0a13f37	I don't understand the purpose
18:24 ^🔗	dd0a13f37	Anyone who wants to do evil stuff would just use one of those scripts, and they're using a botnet so they wouln't care about cloudflare infecting them
18:24 ^🔗	dd0a13f37	What's the point?
18:24 ^🔗	JAA	Idk either
18:26 ^🔗		etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
18:28 ^🔗	dd0a13f37	I don't get it, why can't you just use proxies for the really unfriendly sites?
18:28 ^🔗		Asparagir has joined #archiveteam-bs
18:29 ^🔗	JAA	And by the way, it's not just about CloudFlare serving evil code. Anyone could easily trigger cloudflare-scrape from their own server with an appropriate response.
18:29 ^🔗		svchfoo3 sets mode: +o Asparagir
18:29 ^🔗		svchfoo1 sets mode: +o Asparagir
18:29 ^🔗	dd0a13f37	Well, I doubt you care about ACE when running a botnet
18:30 ^🔗	JAA	Specifically: https://github.com/Anorov/cloudflare-scrape/blob/ee17a7a145990d6975de0be8d8bf5b0abbd87162/cfscrape/__init__.py#L41-L47
18:30 ^🔗	JAA	Yeah, I just mean in general.
18:31 ^🔗	dd0a13f37	There are commercial proxy providers with clean IPs, the cost of renting a bunch would probably be cheaper than what you spend on hard drives
18:34 ^🔗	dd0a13f37	Got another response from itorrents, he said he would upload database to archive.org and send link, the other three still haven't responded
18:42 ^🔗	dd0a13f37	JAA: Looking at generated jsfuck code, it's usually very long
18:43 ^🔗	dd0a13f37	CF is quite short
18:43 ^🔗	dd0a13f37	so you should be able to use a regex and limit the length
18:44 ^🔗	dd0a13f37	for example encoding the character a is 846 chars encoded
18:45 ^🔗	dd0a13f37	http://www.jsfuck.com/
18:47 ^🔗	dd0a13f37	And CF's brackets are always empty - [], jsfuck needs to have something inside to eval
18:48 ^🔗	JAA	Yeah, I'm aware of that. It's still sloppy though.
18:49 ^🔗	dd0a13f37	It should be safe though
18:49 ^🔗	JAA	I don't think you strictly need something inside the brackets to do things in JSFuck, but it probably helps shorten the obfuscated code.
18:50 ^🔗	dd0a13f37	You can never get the eval() you need to do bad things
18:50 ^🔗	dd0a13f37	It shouldn't be turing complete
18:53 ^🔗	JAA	Possible
18:53 ^🔗	JAA	I don't really know enough about JSFuck to say for sure.
18:57 ^🔗		arkhive has joined #archiveteam-bs
18:57 ^🔗	dd0a13f37	https://esolangs.org/wiki/JSFuck
18:58 ^🔗	dd0a13f37	it needs a big blob which is not possible to encode in under a certain amount of characters, it's ugly as fuck but it should be safe
18:59 ^🔗	dd0a13f37	the eval blob is 831 characters, so if you set an upper limit at 200 you should be fine
19:02 ^🔗		etudier has joined #archiveteam-bs
19:06 ^🔗		etudier has quit IRC (Client Quit)
19:06 ^🔗		dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
19:07 ^🔗	mundus	What's the best tool for large site archival?
19:07 ^🔗		arkhive has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…)
19:22 ^🔗	JAA	mundus: Define "large"?
19:23 ^🔗	mundus	like a million pages
19:23 ^🔗	JAA	wpull can handle that easily, assuming you have sufficient disk space.
19:23 ^🔗	mundus	Okay
19:23 ^🔗	mundus	I was guessing wpull
19:24 ^🔗	JAA	Not sure if it's the "best" tool, but it works well.
19:24 ^🔗	JAA	I've ran multi-million URL archivals with wpull several times.
19:24 ^🔗	mundus	alright, what options do you normally use?
19:25 ^🔗	JAA	I think I mostly copied those used in ArchiveBot, then adapted them a bit in some cases.
19:26 ^🔗	JAA	https://github.com/ArchiveTeam/ArchiveBot/blob/a6e6da8ba37e733e4b10b7090b5fc4a6cffc9119/pipeline/archivebot/seesaw/wpull.py#L18-L53
19:26 ^🔗	mundus	cool, thanks
19:35 ^🔗	joepie91_	mundus: you may find grab-site useful also
19:35 ^🔗	joepie91_	sort of like a local archivebot
19:35 ^🔗	joepie91_	mundus: ref https://github.com/ludios/grab-site
19:36 ^🔗	mundus	oh nice
19:47 ^🔗	second	chfoo: do you have a doc explaining how wpull works with youtube-dl etc or how it should work?
19:55 ^🔗	second	How do I become a member of the ArchiveTeam and what would that mean?
19:58 ^🔗	second	JAA: is there a doc somewhere with how the IA archives things and keeps bacups?
19:58 ^🔗	second	backups
19:59 ^🔗		etudier has joined #archiveteam-bs
20:02 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
20:04 ^🔗	JAA	second: You become a member by doing stuff that aligns with AT's activities. There isn't anything formal.
20:06 ^🔗	JAA	There is some stuff in the "help" section of archive.org, and also some blog entries. Not sure what else exists.
20:07 ^🔗	JAA	I don't think the individual archival strategies etc. are documented well (publicly) though.
20:12 ^🔗		BartoCH has joined #archiveteam-bs
20:21 ^🔗	jrwr	second: anyone can do /something/ we are more of a method then anything, what do you want to do?
20:26 ^🔗		kristian_ has quit IRC (Remote host closed the connection)
20:26 ^🔗	second	not sure, I'm more working on file categorization / curation right now
20:27 ^🔗	second	What kind of things shouldn't we archive?
20:28 ^🔗	jrwr	Well
20:28 ^🔗	jrwr	Thats a hard question
20:29 ^🔗	jrwr	If you are doing web archival, I would make sure to save everything as WARCs
20:29 ^🔗	jrwr	(wget supports this, so does wpull)
20:30 ^🔗	jrwr	Anything else, just do best quality you can. the more metadata the better
20:30 ^🔗	jrwr	make an account on IA and go to town uploading things
20:31 ^🔗	jrwr	check out SketchCow's IA and see how he uploads things
20:31 ^🔗	jrwr	(for things like CDs, Tapes, Paper)
20:58 ^🔗		DFJustin has quit IRC (Remote host closed the connection)
21:08 ^🔗		DFJustin has joined #archiveteam-bs
21:08 ^🔗		swebb sets mode: +o DFJustin
21:25 ^🔗		ZexaronS has quit IRC (Quit: Leaving)
22:20 ^🔗		drumstick has joined #archiveteam-bs
22:25 ^🔗		Honno has quit IRC (Read error: Operation timed out)
22:30 ^🔗		Soni has quit IRC (Ping timeout: 506 seconds)
22:41 ^🔗		Soni has joined #archiveteam-bs
22:41 ^🔗	second	Does the internet archive have deduplication active?
22:41 ^🔗	second	I wouldn't want to upload a bunch of stuff and waste their space
22:41 ^🔗		ZexaronS has joined #archiveteam-bs
22:43 ^🔗	second	JAA: has this been archived? https://www.reddit.com/r/opendirectories/comments/6zuk7v/alexandria_library_38029_ebooks_from_5268_author/
22:44 ^🔗	second	https://alexandria-library.space/Ebooks/Author/
22:44 ^🔗	second	https://alexandria-library.space/Ebooks/ComputerScience/
22:44 ^🔗	second	https://alexandria-library.space/Images/ww2/north-american-aviation-world-war-2/
22:44 ^🔗	second	https://alexandria-library.space/Images/
22:45 ^🔗	JAA	Not yet, as far as I know, but arkiver just added them to ArchiveBot.
22:46 ^🔗	arkiver	yeah
22:47 ^🔗		BartoCH has quit IRC (Quit: WeeChat 1.9)
22:50 ^🔗	second	Did you do it because I said something or was it already added? I'm wondering if you guys watch that and other reddit(s)
22:51 ^🔗	second	Is there an archive of scihub?
22:53 ^🔗	JAA	I watch some subreddits, but not opendirectories (yet).
22:53 ^🔗	arkiver	added because you said it
22:53 ^🔗	arkiver	it looks like something we want to archive
22:54 ^🔗	JAA	We were discussing libgen several times in the past few days. See the logs: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-17,Sun
22:55 ^🔗	JAA	Basically, at this point, I assume that IA has a darked copy of it, and even if they don't, the dataset won't disappear anytime soon and can still be archived if libgen actually gets in trouble.
22:59 ^🔗	second	Isn't libgen always possibly in trouble?
22:59 ^🔗	second	Different governments / institutions trying to shut it down
22:59 ^🔗	second	JAA are you Jason Scott?
22:59 ^🔗	JAA	Possible, but I wouldn't be worried about the data until libgen actually goes offline or similar.
23:00 ^🔗	JAA	The data is available in (active) torrents and on Usenet...
23:00 ^🔗	JAA	No, that's SketchCow.
23:01 ^🔗	second	How does one setup a Usenet account / get one, is there a guide somewhere?
23:01 ^🔗	JAA	First rule of Usenet...
23:02 ^🔗	second	Dammit
23:02 ^🔗	JAA	:-P
23:02 ^🔗	JAA	Check out /r/usenet. They have a ton of good information.
23:03 ^🔗	second	Will you guys archive porn?
23:03 ^🔗	JAA	Well, we did archive Eroshare, so there's that.
23:04 ^🔗		Soni has quit IRC (Read error: Connection reset by peer)
23:04 ^🔗	JAA	There's also that 2 PB webcam archive by /u/Beaston02.
23:04 ^🔗	second	Eh, I found a wiki which list actors in porn but you need to login
23:04 ^🔗	JAA	That's not on IA though.
23:05 ^🔗	second	Can you archive it?
23:05 ^🔗	second	Why not?
23:05 ^🔗	second	All this stuff on the IA and the most viewed stuff in the art museum is vintage porn
23:05 ^🔗	second	http://95.31.3.127/pbc/Main_Page
23:06 ^🔗	JAA	Well, I don't think IA is interested in spending 3-4 million dollars over the next few years for random porn webcams.
23:09 ^🔗	JAA	(That number is based on https://twitter.com/textfiles/status/885527796583284741 )
23:11 ^🔗	second	How do people archive 2PB of data?!
23:11 ^🔗	JAA	I'm not saying it shouldn't be archived. In general, my opinion is that everything should be kept. Unfortunately though, that's not very realistic, and I think there are more important things to preserve than random porn webcams.
23:12 ^🔗	JAA	Amazon Cloud Drive and now Google Drive.
23:12 ^🔗	second	Wait a minute, Jason Scott is the same guy behind textfiles.com, interesting
23:12 ^🔗	JAA	Some people suspect that ACD only killed the unlimited offer because of Beaston02 storing those webcam recordings there.
23:13 ^🔗	second	JAA: are there any upcoming store breakthroughs that you can think of?
23:13 ^🔗	second	Lol, "this is why we can't have nice things"
23:14 ^🔗		ld1 has quit IRC (Read error: Connection reset by peer)
23:17 ^🔗	JAA	No idea really. HAMR will come, but that probably won't really reduce storage costs massively, i.e. not a real breakthrough. DNA storage is still far away, I guess. Otherwise, I don't really know too much about other technologies currently in development.
23:20 ^🔗		ld1 has joined #archiveteam-bs
23:32 ^🔗		etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
23:35 ^🔗		etudier has joined #archiveteam-bs
23:38 ^🔗	jrwr	I think DNA might be a good ROM
23:38 ^🔗	jrwr	not WMRM
23:41 ^🔗	jrwr	or like old school tape drives
23:49 ^🔗	JAA	Yeah, it sounds pretty perfect for long-term archival.

irclogger-viewer