#archiveteam-bs 2018-09-18,Tue

↑back Search

Time	Nickname	Message
00:40 ^🔗		bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
00:58 ^🔗		AlbardinG is now known as flashback
01:03 ^🔗		flashback is now known as Flashback
01:18 ^🔗		omglolbah has quit IRC (Read error: Operation timed out)
01:23 ^🔗		omglolbah has joined #archiveteam-bs
01:40 ^🔗	adinbied	moufu - thanks so much! That worked! Also, no problem schbirid! I'm now running into an issue where wget isn't grabbing the photos and other content on the page, but just the html. I'm sure I'm missing something obvious, but I can't figure out what it is
01:51 ^🔗		BartoCH has quit IRC (Ping timeout: 615 seconds)
01:51 ^🔗		BartoCH has joined #archiveteam-bs
01:52 ^🔗		BlueMaxim has joined #archiveteam-bs
01:52 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
01:53 ^🔗		BlueMaxim has joined #archiveteam-bs
01:55 ^🔗		BlueMax has quit IRC (Ping timeout: 633 seconds)
02:01 ^🔗		albardin has joined #archiveteam-bs
02:10 ^🔗	w0rmhole	godane: nice! how'd you do it? did you use a really fancy scanner? were you able to keep the magazines in-tact (aka w/o cutting the pages out)
02:11 ^🔗	w0rmhole	last question, what's the source format? i want to download it in the best quality possible
02:12 ^🔗	Flashfire	JPEG definetly JPEG
02:12 ^🔗	Flashfire	and .txt files
02:14 ^🔗	Flashfire	W0rmhole I know all of the file types
02:15 ^🔗	w0rmhole	-_-
02:15 ^🔗	w0rmhole	you mean .bmp?
02:15 ^🔗	w0rmhole	16 colour
02:15 ^🔗	Flashfire	Yes thats what I meant
02:16 ^🔗	Flashfire	and XBM files
02:50 ^🔗	godane	i use JPEG with 90% compression
02:51 ^🔗	godane	i do have a fancy scanner and the books are kept intact
02:51 ^🔗	godane	its a plustek opticbook 4800
02:52 ^🔗	godane	whats funny is that had the scanner giving to me in early 2013 by Jason
02:52 ^🔗	godane	i think only scanned 17 things before cause i had go to windows to use it cause there is no linux driver
02:54 ^🔗	godane	best part is it was best scanner of 2011 on a linux site: linux.sys-con.com/node/2068241
02:55 ^🔗	godane	in case you can view the website: https://web.archive.org/web/20131201213438/linux.sys-con.com/node/2068241
02:55 ^🔗	godane	*can't view the website
02:57 ^🔗	godane	a part of me thought that was always weird cause if there is no linux drivers then how is it the best scanner of the year on a linux site :P
03:20 ^🔗	godane	SketchCow: btw i need those return labels so i buy more tapes on ebay using my patreon money
03:20 ^🔗	godane	sent more the 12 labels also please
03:21 ^🔗	godane	cause i want to sent ALL of the boxes and then some afterwards
03:23 ^🔗	godane	also if possible put return labels in future boxes so we don't have this problem again
06:07 ^🔗		Mateon1 has quit IRC (Ping timeout: 268 seconds)
06:07 ^🔗		Mateon1 has joined #archiveteam-bs
06:11 ^🔗		omglolbah has quit IRC (Ping timeout: 268 seconds)
06:13 ^🔗		svchfoo1 has quit IRC (Ping timeout: 268 seconds)
06:14 ^🔗		svchfoo1 has joined #archiveteam-bs
06:15 ^🔗		svchfoo3 sets mode: +o svchfoo1
06:15 ^🔗		omglolbah has joined #archiveteam-bs
06:46 ^🔗		Stilett0 has joined #archiveteam-bs
07:02 ^🔗		svchfoo1 has quit IRC (Ping timeout: 268 seconds)
07:04 ^🔗		kiskabak has quit IRC (se.hub efnet.portlane.se)
07:04 ^🔗		Kaz has quit IRC (se.hub efnet.portlane.se)
07:09 ^🔗		svchfoo1 has joined #archiveteam-bs
07:09 ^🔗		kiskabak has joined #archiveteam-bs
07:09 ^🔗		Kaz has joined #archiveteam-bs
07:09 ^🔗		svchfoo3 sets mode: +o svchfoo1
07:10 ^🔗		t2t2 has quit IRC (Read error: Operation timed out)
07:11 ^🔗		t2t2 has joined #archiveteam-bs
08:48 ^🔗		t2t2 has quit IRC (Quit: No Ping reply in 210 seconds.)
08:49 ^🔗		BartoCH has quit IRC (Quit: WeeChat 2.2)
08:50 ^🔗		BartoCH has joined #archiveteam-bs
08:50 ^🔗		t2t2 has joined #archiveteam-bs
08:50 ^🔗		Stilett0 is now known as Stiletto
09:25 ^🔗		antomatic has joined #archiveteam-bs
09:25 ^🔗		swebb sets mode: +o antomatic
09:27 ^🔗		antomati_ has quit IRC (Read error: Operation timed out)
10:35 ^🔗		wp494 has quit IRC (Ping timeout: 492 seconds)
10:37 ^🔗		wp494 has joined #archiveteam-bs
10:38 ^🔗		icedice has joined #archiveteam-bs
11:11 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
12:47 ^🔗		icedice has quit IRC (Quit: Leaving)
13:10 ^🔗		bitBaron has joined #archiveteam-bs
13:51 ^🔗		Nicu has joined #archiveteam-bs
13:53 ^🔗	Nicu	How about attempt archiving Tripod.Lycos?
13:55 ^🔗	PurpleSym	Nicu: We do accept donations: https://opencollective.com/archiveteam
13:56 ^🔗	kiska	Nicu How would we discover hosted services?
13:58 ^🔗		eientei95 has joined #archiveteam-bs
13:58 ^🔗	Nicu	manually in case it's not yet ready for automatic work?
13:59 ^🔗	kiska	Anyway eientei95 is here from your previous discussion, continue
14:00 ^🔗	Nicu	so i think you could expand
14:00 ^🔗	Nicu	it's a great cause
14:01 ^🔗	Nicu	but there's not much info
14:01 ^🔗	Nicu	about what has been done
14:01 ^🔗	Nicu	or it's not readily available
14:02 ^🔗	Nicu	also, we understand how valuable is the information stored but what could one do with it in the future?
14:04 ^🔗	kiska	Unfortunately I can't find any reference to any sites stored on tripod.lycos through their robots.txt and sitemap
14:04 ^🔗	kiska	Any site crawl we do end up on Wayback Machine
14:06 ^🔗	Nicu	tripod was the homologue of GeoCities right
14:06 ^🔗	Nicu	doesn't it make sense to occupy that too?
14:06 ^🔗	Nicu	that's were i'm coming from
14:07 ^🔗	JAA	kiska: There's http://members.tripod.com/robots.txt . I'm not really sure what the relation between tripod.com and tripod.lycos.com is.
14:08 ^🔗	JAA	Nicu: We usually only go after sites when they are shutting down or removing content or in immediate risk. I agree though that it would be nice to grab Tripod in its entirety.
14:09 ^🔗	kiska	I guess we could chuck a few at a time into #archivebot
14:09 ^🔗	kiska	And we might as well grab angelfire as well since their the same company(I think)
14:10 ^🔗	JAA	Also, I went through the IRC logs. While Tripod has been brought up several times, I don't see anything regarding a systematic archival.
14:10 ^🔗	JAA	Angelfire is in progress over at #angelonfire .
14:10 ^🔗	Nicu	that's good. If I help to copy how do I know that my work will be safe and useful in the future?
14:10 ^🔗	JAA	s/in progress/being worked on/
14:11 ^🔗	Nicu	is it stored in a data center
14:11 ^🔗	JAA	It is stored at the Internet Archive.
14:11 ^🔗	Nicu	that will "work for human kind" in the future?
14:11 ^🔗	Nicu	do you have an agreement or is it an open kind of thing
14:11 ^🔗	JAA	IA has their own DC in San Francisco.
14:11 ^🔗	JAA	There are several people from IA in ArchiveTeam.
14:12 ^🔗	Nicu	ok
14:12 ^🔗	JAA	(Well, I know of two, so not sure if that counts as "several".)
14:13 ^🔗	JAA	We also have an independent project to mirror the most important content from IA in a distributed manner. Check out IA.BAK on our wiki. (I have no idea how active that project is nowadays though.)
14:13 ^🔗	JAA	However, IA stores around 45 PB of unique data currently, so mirroring it all is expensive.
14:14 ^🔗	kiska	I know kiskaJDC has ~900GB of a shard
14:16 ^🔗	Nicu	can i get a pass for wiki?
14:16 ^🔗	jrwr	also Nicu we have been around a good while, If you want to archive something and a project is stalled or not even started, learn how our pipelines work on submit it to be archived as a warrior project
14:16 ^🔗	Nicu	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
14:16 ^🔗	jrwr	thats the resource we are the the most limited on
14:17 ^🔗	jrwr	programmers and engineers
14:17 ^🔗	JAA	Nicu: What do you want to do on the wiki?
14:17 ^🔗	Nicu	expand the mindset for archiving the Internet
14:17 ^🔗	jrwr	How so?
14:18 ^🔗	jrwr	Lets discuss it
14:18 ^🔗	Nicu	i'm nostalgic for how creative it used to be
14:18 ^🔗	Nicu	homepages that expressed individuality
14:18 ^🔗	Nicu	web design that was unexplainable beautiful
14:18 ^🔗	Nicu	perhaps there might be a way to preserve this
14:18 ^🔗	jrwr	How do you see us expanding the mindset for archiving the internet
14:19 ^🔗	Nicu	also the names for irc channels that were beautiful (esp. on Undernet)
14:19 ^🔗	Nicu	perhaps archive the IRC logs
14:19 ^🔗	JAA	Our mindset is already "archive ALL the things". I don't think there's much to expand there. :-)
14:19 ^🔗	jrwr	lol
14:19 ^🔗	eientei95	lol
14:19 ^🔗	Nicu	but you don't do anything in this direction
14:19 ^🔗	Nicu	how to categorize
14:19 ^🔗	Nicu	web design under ALL?
14:19 ^🔗	jrwr	JAA: how much per month do we shove into archivebot?
14:19 ^🔗	jrwr	1-2TB?
14:20 ^🔗	eientei95	IRC operators tend to not like channels being idled in for logs
14:20 ^🔗	Igloo	Way more jrwr
14:20 ^🔗	Igloo	More like 20-30+ some months
14:20 ^🔗	jrwr	Ya
14:20 ^🔗	JAA	jrwr: We uploaded 23.7 TiB in August.
14:20 ^🔗	JAA	485 TiB in total since ArchiveBot was started.
14:20 ^🔗	jrwr	Nicu: even now: Job status: 95499 completed, 2931 aborted, 567 failed, 78 in progress, 23 pending
14:20 ^🔗	Nicu	how to access it?
14:21 ^🔗	jrwr	all gets uploaded to the wayback machine
14:21 ^🔗	jrwr	and the internet archive as WARCs
14:21 ^🔗	jrwr	so normal people can access it with the wayback machine
14:21 ^🔗	Nicu	sounds like trash and not info that could be instructable
14:21 ^🔗	JAA	https://archive.org/details/archivebot if you want the raw data. https://web.archive.org/ if you want to browse it.
14:22 ^🔗	Igloo	Aug '18 3.09 TiB / 2.57 TiB / 5.66 TiB
14:22 ^🔗	jrwr	Well, thats the thing, how do you present the billions of webpages we have saved
14:22 ^🔗	kiska	eg https://web.archive.org/web/20180906155341/https://oldforums.eveonline.com/ Archivebot did this about 2 weeks ago. And you can see it by clicking on "About this capture"
14:22 ^🔗	jrwr	A eve nerd! :)
14:22 ^🔗	JAA	Nicu: Feel free to download the CDXes from the ArchiveBot collection and build a nice interface for it. Hint, it'll be huge.
14:22 ^🔗	jrwr	Ya, the wayback does allow for searching
14:23 ^🔗	JAA	Yeah, but only page titles unfortunately.
14:23 ^🔗	jrwr	but browsing is a little harder, like the old webrings (that still work!)
14:23 ^🔗	kiska	jrwr xD And I think that upload was due to CCP being bought out by Pearl Abyss
14:23 ^🔗	jrwr	lol
14:23 ^🔗	jrwr	there are dozens of us, DOZENS
14:23 ^🔗	kiska	So I chucked it into the bot, it landed on my SSD pipeline, therefore it ran out of space
14:24 ^🔗	jrwr	Nicu: thats the hardest thing, there is a metric ton of data, stored in open formats waiting for someone to do something with it
14:24 ^🔗	eientei95	I'm still on the tutorial
14:24 ^🔗	jrwr	We are here to make sure that data event exists
14:24 ^🔗	eientei95	Archive now, make pretty later
14:24 ^🔗	jrwr	eientei95: Play with others, join Pandemic Horde or any of the other newbie alliances
14:24 ^🔗	jrwr	you will love eve then
14:25 ^🔗	eientei95	If we make pretty now, archiving will be held back
14:25 ^🔗	jrwr	Correct
14:25 ^🔗	kiska	I played for just about 8 years I think
14:25 ^🔗	eientei95	jrwr: Will do, going to get back to actually playing games
14:25 ^🔗	JAA	Take the Eve discussion to -ot please.
14:25 ^🔗	jrwr	and we are open to any project/website to be archived, we have archivebot for the smaller jobs
14:25 ^🔗	kiska	But yes make pretty now will make archival very slow
14:25 ^🔗	jrwr	and anyone can make a pipeline / wget-lua code to archive a site when required
14:26 ^🔗	kiska	Especially sites that use a ton of Javascript
14:26 ^🔗	jrwr	Nicu: any thoughts so far?
14:26 ^🔗	eientei95	i,e. Modern websites
14:26 ^🔗	Nicu	i'm thinking of my 2TB of free space
14:26 ^🔗	Nicu	and that they can't be useful
14:26 ^🔗	Nicu	since it's stored in IA?
14:26 ^🔗	eientei95	"can't be useful"
14:26 ^🔗	eientei95	[02:25:56] <jrwr> and anyone can make a pipeline / wget-lua code to archive a site when required
14:26 ^🔗	Nicu	or maybe use tha IA.BAK thing?
14:27 ^🔗	jrwr	Run a warrior, run a dozen of them, docker is great for this
14:27 ^🔗	JAA	eientei95: Warrior instances don't use much disk space and normally don't keep all the data anyway.
14:27 ^🔗	jrwr	your part helps with archiving sites when the need arises
14:27 ^🔗	Nicu	i feel like i am losing the motivation since it's like grabbing what you can
14:27 ^🔗	jrwr	Can you program Nicu
14:28 ^🔗	JAA	Yep, that. And you can join IA.BAK also if you want.
14:28 ^🔗	eientei95	JAA: Right, I've had a few instances where Warrior failed due to lack of disk space
14:28 ^🔗	jrwr	IA.BAK is pretty good
14:28 ^🔗	jrwr	always need more diskspace for that
14:28 ^🔗	Nicu	the urge is now, for all the good thing social networks and telegram destroy like hurricanes
14:28 ^🔗	JAA	eientei95: Oh, that can certainly happen. But it doesn't need 2 TB of disk space.
14:28 ^🔗	Nicu	and ArchiveTeam "just shoves" :-D
14:29 ^🔗	JAA	What would you suggest we do? Ignore the sites that are shutting down all the time, and let their data be lost forever?
14:29 ^🔗	Nicu	i am not suggesting, just saying this is too generalistic
14:29 ^🔗	Nicu	am i wrong?
14:30 ^🔗	jrwr	Its all we can do, there is so much to save
14:30 ^🔗	eientei95	WE are all but a drop in the ocean
14:30 ^🔗	jrwr	^^
14:30 ^🔗	Nicu	doesn't it matter WHAT we save
14:30 ^🔗	Igloo	We save what we we can. As much as we can
14:30 ^🔗	Nicu	i don't want to interrupt the good work though
14:30 ^🔗	Nicu	just putting it to debate
14:30 ^🔗	eientei95	Start up a pipeline, shove sites you want archived into it
14:31 ^🔗	jrwr	Also, Anyone can upload to the internet archive
14:31 ^🔗	Igloo	Not anymore IIRC
14:31 ^🔗	eientei95	^
14:31 ^🔗	Igloo	I think only some stuff goes into the way back
14:31 ^🔗	jrwr	ya
14:31 ^🔗	JAA	Anyone can upload to IA. But it only goes into the WBM if you're whitelisted.
14:31 ^🔗	JAA	That doesn't mean that uploads that don't go into the WBM aren't useful.
14:31 ^🔗	jrwr	but if you make good WARCs (built into wget ) save anything and everything you can, add good metadata for people to find it
14:32 ^🔗	Nicu	breaking my dreams of seeing Internet not just a dump like now :-D
14:32 ^🔗	jrwr	the wayback machine is the best method to relive that old internet
14:32 ^🔗	jrwr	we are doing everything we can to give it as much data as we can
14:34 ^🔗	jrwr	If you want to archive websites we are not covering, go do it then, all the knowledge we have is in the public, our wiki, our github
14:35 ^🔗	Nicu	i know the logic tolds you it's correct to run this machine of endless copying don't know what but the heart and soul tells you I'm right but you don't want to take this into consideration
14:35 ^🔗		dxrt- has joined #archiveteam-bs
14:35 ^🔗		dxrt has quit IRC (Ping timeout: 252 seconds)
14:35 ^🔗	jrwr	Ok, you are not making any sense Nicu
14:36 ^🔗	jrwr	Right now, What do you want us to do
14:36 ^🔗	jrwr	Dumb is down for me
14:37 ^🔗	Nicu	go to the darkside
14:37 ^🔗		hook54321 has quit IRC (Ping timeout: 252 seconds)
14:37 ^🔗		i0npulse has quit IRC (Ping timeout: 252 seconds)
14:38 ^🔗	jrwr	Thats not even remotely helpful,
14:38 ^🔗	jrwr	I'm ignoring you now
14:38 ^🔗	adinbied	Completely off-topic from the discussion going on, but does anyone know why my WGetDownload isn't grabbing images?
14:39 ^🔗	JAA	Content gets lost all the time because websites shut down. We try to save what we can. And you're saying this isn't useful?
14:39 ^🔗	JAA	Also, it's not ours to decide which information should be preserved. We don't know what will be useful for future historians. So "archive ALL the things".
14:40 ^🔗	adinbied	https://github.com/adinbied/angelfire-grab/blob/master/pipeline.py#L162 is my WGetArgs & for some reason it's only grabbing the HTML & not all of the resources (gifs, pngs, jpgs, embedded content)
14:40 ^🔗	adinbied	https://www.archiveteam.org/images/c/ce/Archive-all-the-things-thumb.jpg
14:41 ^🔗		Nicu has quit IRC ()
14:41 ^🔗	JAA	adinbied: The problem is your download_child_p hook. It always returns false, meaning everything should be skipped.
14:42 ^🔗	JAA	I think the initial URLs passed on the command line are exempt from this filtering. And so that's all it grabs.
14:42 ^🔗	adinbied	Ah, derp. I really need to learn more Lua - that makes sense. Thanks!
14:48 ^🔗	adinbied	Hmmm.. Getting rid of that still doesn't seem to work....
14:49 ^🔗		pikhq has quit IRC (se.hub irc.underworld.no)
14:49 ^🔗		kiska has quit IRC (se.hub irc.underworld.no)
14:49 ^🔗		Flashfire has quit IRC (se.hub irc.underworld.no)
14:49 ^🔗		w0rmhole has quit IRC (se.hub irc.underworld.no)
14:50 ^🔗		pikhq_ has joined #archiveteam-bs
14:52 ^🔗		kiska has joined #archiveteam-bs
14:52 ^🔗		hook54321 has joined #archiveteam-bs
14:52 ^🔗		w0rmhole has joined #archiveteam-bs
14:54 ^🔗		i0npulse has joined #archiveteam-bs
14:54 ^🔗		Flashfire has joined #archiveteam-bs
15:08 ^🔗	jrwr	JAA: so I had an interesting idea
15:08 ^🔗	jrwr	IPFS content mirroring to the IA, cache all the content we can (its easy to discover random content) and upload it to the IA or have a IA box pin content on the IPFS to save it
15:10 ^🔗	JAA	Yeah, the latter would probably be the easiest, just get IA to pin the content and it'll live forever.
15:10 ^🔗	jrwr	https://github.com/ipfs-search/ipfs-search
15:10 ^🔗	jrwr	looks like you can sniff the DHT traffic to find content
15:11 ^🔗	jrwr	since content that doesn't get pinned will get purged after some time
15:11 ^🔗	jrwr	or not access
15:22 ^🔗	JAA	Is there any estimate how big IPFS is?
15:27 ^🔗	jrwr	~Lots~
15:27 ^🔗	jrwr	https://github.com/victorbjelkholm/ipfscrape
15:27 ^🔗	jrwr	JAA: interesting project
15:27 ^🔗	jrwr	saves webpages and stores them into ipfs
15:32 ^🔗	JAA	Interesting indeed.
15:38 ^🔗	jrwr	like right now
15:39 ^🔗	jrwr	im shoving my entire dataset (about 20k files) into IPFS
15:39 ^🔗	jrwr	so my users can use it, since they like having the entire dataset at times
15:47 ^🔗		Jon has quit IRC (Read error: Operation timed out)
15:50 ^🔗		jmtd has joined #archiveteam-bs
16:22 ^🔗		icedice has joined #archiveteam-bs
16:25 ^🔗	jrwr	wonder who we poke to do something like that :)
16:25 ^🔗	adinbied	arkiver, I'm sure you are insanely busy (I know life gets in the way) - whenever you can, would you be able to get the Angelfire project setup (ie getting the tracker setup and the github repo initialized) and looking over the quizlet target and giving the OK to proceed?
16:45 ^🔗	astrid	that joker "hello_" /msg'd me and demanded pictures
16:45 ^🔗	astrid	buddy, i grew up online and my girlfriend does porn. this isn't my first rodeo.
16:52 ^🔗	adinbied	OK, so it seems my issue is in the Lua - how do I specify the if string match .jpg,.png,*.gif then add to url queue
16:54 ^🔗	adinbied	In the wget callbacks get urls function I need to specify in broad general terms that if an image is found in the HTML of the page, then to add it to grab
16:56 ^🔗		Dimtree has joined #archiveteam-bs
17:28 ^🔗		Dimtree has quit IRC (Peace)
17:44 ^🔗		Dimtree has joined #archiveteam-bs
17:52 ^🔗		Dimtree has quit IRC (Quit: Peace)
17:54 ^🔗		icedice has quit IRC (Quit: Leaving)
17:57 ^🔗		Dimtree has joined #archiveteam-bs
18:15 ^🔗		jmtd has quit IRC (Ping timeout: 252 seconds)
18:15 ^🔗		i0npulse has quit IRC (Ping timeout: 252 seconds)
18:15 ^🔗		w0rmhole has quit IRC (Ping timeout: 252 seconds)
18:16 ^🔗		Jon- has joined #archiveteam-bs
18:16 ^🔗		Flashfire has quit IRC (Ping timeout: 252 seconds)
18:16 ^🔗		hook54321 has quit IRC (Ping timeout: 252 seconds)
18:18 ^🔗		kiska has quit IRC (Ping timeout: 252 seconds)
18:43 ^🔗		i0npulse has joined #archiveteam-bs
18:46 ^🔗		hook54321 has joined #archiveteam-bs
20:39 ^🔗		Lord_Nigh has quit IRC (Quit: ZNC - http://znc.in)
20:41 ^🔗		Lord_Nigh has joined #archiveteam-bs
21:09 ^🔗		ColdIce has joined #archiveteam-bs
21:13 ^🔗		bitBaron has quit IRC (Read error: Connection reset by peer)
21:14 ^🔗		bitBaron has joined #archiveteam-bs
21:26 ^🔗		bitBaron has quit IRC (My computer has gone to sleep. 😴😪ZZZzzz…)
21:27 ^🔗		bitBaron has joined #archiveteam-bs
21:35 ^🔗		Flashfire has joined #archiveteam-bs
23:50 ^🔗		bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)

irclogger-viewer