#archiveteam-bs 2017-10-29,Sun

↑back Search

Time	Nickname	Message
00:50 ^🔗		Honno has joined #archiveteam-bs
01:17 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
01:30 ^🔗		drumstick has quit IRC (Ping timeout: 360 seconds)
01:30 ^🔗		drumstick has joined #archiveteam-bs
02:00 ^🔗		Honno has quit IRC (Read error: Operation timed out)
02:16 ^🔗	hook54321	I've been recording a camera feed from a bar in Catalonia for hours... But I'm still on the fence on if I should upload it to archive.org or not.
02:24 ^🔗		schbirid has quit IRC (Ping timeout: 255 seconds)
02:36 ^🔗		schbirid has joined #archiveteam-bs
02:37 ^🔗		pizzaiolo has joined #archiveteam-bs
03:30 ^🔗	godane	so this guy has a interesting set of vhs : https://www.ebay.com/sch/VHS-Video-Tapes/149960/m.html?_nkw=&_armrs=1&_ipg=&_from=&_ssn=froggyholler
03:30 ^🔗	godane	i'm only interested in one tape that said Dog Day Afternoon with time of 1h 57m
03:30 ^🔗	godane	on TB
03:31 ^🔗	godane	*TBS
03:35 ^🔗	hook54321	I wonder what would happen if we set up something like a GoFundMe so we could obtain stuff like that
03:43 ^🔗	godane	i do have a patreon page: https://www.patreon.com/godane
03:43 ^🔗	godane	hook54321: its how i got the last set of tapes from ebay
03:55 ^🔗	godane	SketchCow: i'm uploading the original broadcast of Dinotopia so don't upload my vhs stuff for the next day
03:56 ^🔗	godane	its about 17,272,156,160 in size
04:19 ^🔗		qw3rty113 has joined #archiveteam-bs
04:22 ^🔗		Stilett0 has joined #archiveteam-bs
04:25 ^🔗		qw3rty112 has quit IRC (Read error: Operation timed out)
04:32 ^🔗		Mateon1 has joined #archiveteam-bs
04:33 ^🔗	phillipsj	I saw a few episodes of that.. it was OK, but not great.
04:33 ^🔗	*	phillipsj is getting old.
04:58 ^🔗		drumstick has quit IRC (Ping timeout: 255 seconds)
04:58 ^🔗		drumstick has joined #archiveteam-bs
06:20 ^🔗		pizzaiolo has quit IRC (Ping timeout: 246 seconds)
06:41 ^🔗	SketchCow	Great
08:51 ^🔗	hook54321	I think I've found some Catalonia radio stations that can listened to online, however I've been blocked from them (I'm pretty sure for trying different ports), so I can't record them. :/
09:39 ^🔗		mls has joined #archiveteam-bs
09:42 ^🔗		mls has left
09:45 ^🔗	hook54321	https://www.youtube.com/watch?v=cfmiNyneO88
09:53 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
10:15 ^🔗	JAA	SketchCow: Is there anything that can be done to speed up transfers to FOS? My ArchiveBot pipeline has trouble keeping up as the uploads average only 2-3 MB/s. Or should I look into uploading to IA directly instead?
10:16 ^🔗		Honno has joined #archiveteam-bs
10:44 ^🔗	hook54321	If anyone is interested in joining a discord server where there are people from Catalonia, dm me and if I'm awake I'll probably send you the link, I might ask you a few questions about why you want to join it.
10:50 ^🔗		Aerochrom has joined #archiveteam-bs
12:25 ^🔗		nepeat has quit IRC (ZNC 1.6.5 - http://znc.in)
12:50 ^🔗		will has joined #archiveteam-bs
13:45 ^🔗		pizzaiolo has joined #archiveteam-bs
14:19 ^🔗		Aoede has quit IRC (Ping timeout: 255 seconds)
14:20 ^🔗		Aoede has joined #archiveteam-bs
14:20 ^🔗		Aoede has quit IRC (Connection closed)
14:20 ^🔗		Aoede has joined #archiveteam-bs
14:25 ^🔗		fie has quit IRC (Quit: Leaving)
14:40 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
15:13 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
15:48 ^🔗	SketchCow	JAA: Noo
15:48 ^🔗	SketchCow	JAA: Also Noo
15:49 ^🔗	SketchCow	FOS is slow, is about to be replaced by another FOS
15:54 ^🔗	JAA	Ok, sweet.
15:54 ^🔗	JAA	Any idea when that'll happen?
16:04 ^🔗	SketchCow	Soon?
16:20 ^🔗		julius_ is now known as jschwart
16:21 ^🔗	jschwart	I've made a collection of discs now that I can probably sent to the internet archive, are tehre any instructions online for that?
16:22 ^🔗	jschwart	another thing I have is old hardware with manuals, discs, etc. does anybody have any suggestion on what would be nice to do with that?
16:22 ^🔗	jschwart	and what about software discs which came with relatively big books as manuals?
16:24 ^🔗	SketchCow	Describe discs in this context.
16:26 ^🔗	jschwart	I have for instance Klik & Play, an old tool to make games, it came with quite a big book
16:26 ^🔗	jschwart	also Superlogo which came with a (Dutch) book and discs
17:06 ^🔗		pizzaiolo has joined #archiveteam-bs
17:31 ^🔗		Specular has joined #archiveteam-bs
17:32 ^🔗	Specular	asking again in case someone here knows: is it possible to save a full list of results for a generic 'site:' query from Google?
17:33 ^🔗	dashcloud	Specular: not completely automated- Google quickly hits you with captchas and bans if you try to scrape them
17:33 ^🔗	Specular	god damn it
17:33 ^🔗	dashcloud	but there's a number of tools you can use in browser
17:33 ^🔗	Specular	a site recently hid an entire section of its site—according to one user around 36 million posts—and I'm not sure for how long Google will retain the results of its previous crawls
17:34 ^🔗	dashcloud	apparently the thing you want to do is very popular in the SEO community, so there's a number of extensions for this, and guides on doing it
17:35 ^🔗	Specular	dashcloud, what should I be entering when searching? Couldn't find much when I tried last
17:35 ^🔗	Specular	(that is to find more about how to do this)
17:35 ^🔗	dashcloud	you probably want download google search results to csv
17:36 ^🔗	Specular	for an enormous scrape wouldn't the size become unmanageably large to open?
17:38 ^🔗	Specular	I'm not actually sure if this would cause problems but thought I'd ask
17:38 ^🔗	dashcloud	possibly- then you'd use a different tool to open the file
17:38 ^🔗	dashcloud	notepad++ , a database, or possibly LibreOffice/Excel
17:39 ^🔗	dashcloud	there's this Chrome extension which may or may not work: https://chrome.google.com/webstore/detail/linkclump/lfpjkncokllnfokkgpkobnkbkmelfefj?hl=en
17:40 ^🔗	dashcloud	here's a promising firefox extension: https://jurnsearch.wordpress.com/2012/01/27/how-to-extract-google-search-results-with-url-title-and-snippet-in-a-csv-file/
17:41 ^🔗	dashcloud	jschwart: if you want to upload stuff yourself to the Internet Archive, I can give you some basic guidelines
17:45 ^🔗	dashcloud	if you'd prefer to send stuff to IA, ask SketchCow for the mailing address
17:49 ^🔗	Specular	Google seems like it might offer a way to grab this officially via a service called the 'Search Console', but to get over a certain number of results one needs the pro version. Perhaps someone on the aforementioned site has a subscription.
18:00 ^🔗		benuski has joined #archiveteam-bs
18:25 ^🔗		Specular has quit IRC (be back later)
18:32 ^🔗	jschwart	dashcloud: I'd like to free up the physical space as well
18:33 ^🔗	jschwart	so if I could send things somewhere where they would be useful, I'd rather do that
19:03 ^🔗		schbirid2 has joined #archiveteam-bs
19:04 ^🔗		schbirid has quit IRC (Read error: Connection reset by peer)
19:59 ^🔗		zhongfu has quit IRC (Ping timeout: 260 seconds)
20:05 ^🔗		zhongfu has joined #archiveteam-bs
20:51 ^🔗		benuski has quit IRC (Read error: Operation timed out)
20:54 ^🔗		BlueMaxim has joined #archiveteam-bs
21:07 ^🔗		benuski has joined #archiveteam-bs
21:47 ^🔗		BartoCH has quit IRC (Quit: WeeChat 1.9.1)
21:52 ^🔗		BartoCH has joined #archiveteam-bs
22:12 ^🔗		c4rc4s has quit IRC (Quit: words)
22:25 ^🔗		c4rc4s has joined #archiveteam-bs
22:32 ^🔗		Ceryn has joined #archiveteam-bs
22:34 ^🔗	Ceryn	Hey. Do you guys archive websites? If yes, what tools do you use? Custom ones? wget/warc?
22:35 ^🔗	Ceryn	I'm looking at some available options. It seems I might have to write my own to get the features I want (such as keeping the archive up to date, handling changes in pages, ...).
22:35 ^🔗	JAA	That's what we do. We mostly use wpull (e.g. ArchiveBot and many people using it manually) or wget-lua (in the warrior) and write WARCs.
22:36 ^🔗	Ceryn	Hm. warrior?
22:36 ^🔗	JAA	Our tool to launch Distributed Preservation of Service attacks
22:36 ^🔗	JAA	http://archiveteam.org/index.php?title=Warrior
22:36 ^🔗	Ceryn	Haha.
22:38 ^🔗	JAA	wpull's quite nice if you need to tweak its behaviour. It's written in Python, i.e. you can monkeypatch everything.
22:38 ^🔗	Ceryn	I'll look into Warrior. Thanks.
22:38 ^🔗	Ceryn	wpull's a bot command?
22:38 ^🔗	JAA	wpull's a replacement for wget.
22:38 ^🔗	JAA	It's mostly a reimplementation of wget actually, but in Python rather than nasty unmaintainable C code.
22:39 ^🔗	Ceryn	I want to defend C, but I suppose Python is the better language in this case.
22:39 ^🔗	JAA	Nothing wrong with C in general, but from what I've heard, the wget code is really ugly.
22:40 ^🔗	Ceryn	This one? https://github.com/chfoo/wpull
22:40 ^🔗	JAA	Yep
22:40 ^🔗	Ceryn	Cool.
22:40 ^🔗	DrasticAc	There are C# solutions that exist https://github.com/antiufo/Shaman.Scraping
22:41 ^🔗	JAA	You'll want to use either version 1.2.3 though or the fork by FalconK. 2.0.1 is really buggy.
22:41 ^🔗	Ceryn	And why do you want WARC files? Solely for Time Machine compatibility?
22:41 ^🔗	DrasticAc	Although it needs work, I had to hack it to get it compiled
22:41 ^🔗	JAA	Wayback Machine*
22:41 ^🔗	Ceryn	Right.
22:41 ^🔗	JAA	Well, WARC's a nice format that saves all the relevant metadata (request and response headers etc.).
22:42 ^🔗	DrasticAc	It's easy enough to playback the warc if you want to scrap the HTML.
22:42 ^🔗	DrasticAc	Or whatever it is your getting
22:43 ^🔗	Ceryn	So you want the WARC metadata to be able to re-do scrapes?
22:43 ^🔗	JAA	Yep. You can run pywb or a similar tool to playback a WARC and browse the site just like it was the live site.
22:43 ^🔗	JAA	No, because metadata is just as important as the content itself.
22:43 ^🔗	Ceryn	Hm. Interesting. I was thinking of just storing HTML pages for ease of use, but I'll look into pywb or similar too.
22:43 ^🔗	JAA	Also, you need at least two pieces of metadata (URL and date) to be able to do anything meaningful with the archive at all.
22:44 ^🔗	JAA	Or to even call it an archive, in my opinion.
22:44 ^🔗	JAA	Otherwise it's just a copy (and possibly a modified one to get images etc. to work).
22:45 ^🔗	Ceryn	Oh? ELI5 why full meta data, to the extent WARC provides, is as important as the data? I can see uses and nice-to-haves, but I don't know about must-haves.
22:45 ^🔗	DrasticAc	Oh yeah, I'm just saying if you _did_ want to scrap it after the fact to mine it, it's much easier to use the WARC than the live site (which will probably be down)
22:45 ^🔗	DrasticAc	So getting the warc first makes total sense in any case
22:46 ^🔗	DrasticAc	btw, is archive.org down? I can't connect to the S3 and the main website seems down.
22:46 ^🔗	Ceryn	So WARCs preserve everything in mint condition. Saves me from doing a lot of patchwork myself.
22:46 ^🔗		drumstick has joined #archiveteam-bs
22:47 ^🔗	Ceryn	DrasticAc: archive.org looks like it's timing out here.
22:48 ^🔗	JAA	Yep, down at the moment. THere was some talk about it earlier in #archivebot.
22:50 ^🔗	JAA	WARC doesn't actually contain that much metadata. It contains the URL and timestamp, a record type (to distinguish between requests, responses, and other things), the IP address to which a host resolved at that point in time, and hash digests for verifying that the contents aren't corrupted.
22:52 ^🔗	JAA	You'll want to preserve HTTP requests and response headers to be able to reconstruct what you actually archived. For example, if a site modifies its pages based on user agent detection and you don't store the HTTP request, you'll never know what triggered that response.
22:52 ^🔗	Ceryn	DrasticAc: I hesitate to use a C# solution though. Maybe if it does exactly what I want. But otherwise I risk having to write C#.
22:52 ^🔗	DrasticAc	Yeah, I agree. Wish it was F#.
22:52 ^🔗	JAA	Response headers contain additional information, e.g. about the server software used or when a resource was last modified.
22:53 ^🔗	JAA	I would argue that all of this is very important information.
22:54 ^🔗	Ceryn	I suppose I can't say it's non-important.
22:54 ^🔗	Ceryn	My main use case, I think, would be having copies of sites to go back to for nostalgia or something.
22:55 ^🔗	Ceryn	But maybe WARCs' a better way to go about it anyway.
22:56 ^🔗	JAA	Yeah, I think they're superior in general. Having everything in a single file (or a few of them) is much easier to handle than thousands of files spread across directories (e.g. if you use wget --mirror).
22:58 ^🔗	Ceryn	Are you supposed to dump all logs for a scrape in a single WARC file? Or size-capped WARC files?
22:58 ^🔗	JAA	They're also compressed, which can be a massive space saver.
23:00 ^🔗	JAA	Usually size-capped files of a few GB (I use 2 GiB for my own grabs, ArchiveBot uses 5 GiB by default), though I'm not entirely sure why. It shouldn't really be a problem to have bigger files.
23:00 ^🔗	JAA	You can always split or merge them, too.
23:00 ^🔗	Ceryn	Right.
23:02 ^🔗	Ceryn	If you want to update the archive, do you just WARC from scratch? Or re-visiting links from the WARC data?
23:02 ^🔗	Ceryn	Maybe we're into custom code here.
23:03 ^🔗	JAA	Depends on what you want to do exactly.
23:03 ^🔗	odemg	Doesn't seem like just the site is down, my uploads stopped over torrent and python
23:03 ^🔗	Ceryn	Well, I want to keep my archive up to date while still saving older copies of pages.
23:04 ^🔗	JAA	So WARC supports deduplication by referencing old records. You could make use of that.
23:04 ^🔗	JAA	You can of course also skip any URLs that you know (or assume) haven't changed in the meantime.
23:05 ^🔗	JAA	On playback, you'd get the old version in that case.
23:05 ^🔗	JAA	But yes, this would probably require custom code.
23:05 ^🔗	JAA	The deduplication part, I mean.
23:05 ^🔗	JAA	Heritrix might support it though.
23:06 ^🔗	Ceryn	Hm. WARC does sound better and better.
23:06 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
23:07 ^🔗	Ceryn	Sweet.
23:09 ^🔗	Ceryn	Perhaps I might want to use WARC anyway, haha. I think I might still end up writing relevant code, but I can probably base a whole lot on the WARC library and maybe wpull.
23:10 ^🔗		kvieta has quit IRC (Quit: greedo shot first)
23:11 ^🔗	Ceryn	So thanks for that.
23:12 ^🔗	Ceryn	When you guys archive stuff, is it for your own use/hoard?
23:12 ^🔗		kvieta has joined #archiveteam-bs
23:15 ^🔗	JAA	I'm sure people in here archive stuff for their own use, but most of it goes to the Internet Archive for public consumption.
23:16 ^🔗	Ceryn	And that works by people just archiving and uploading whatever?
23:17 ^🔗	JAA	I'm not entirely sure how it works. I still haven't uploaded my grabs to the IA because I'm too lazy to figure out how to do it.
23:17 ^🔗	Ceryn	Haha okay. That'll be me once it's all up and running.
23:17 ^🔗	JAA	I think you need to set an attribute when uploading the files. Not sure if it happens automatically afterwards or not.
23:19 ^🔗	Ceryn	Do you archive everything on the domain for a given domain name? All resources, images, video, objects?
23:19 ^🔗	JAA	Depends strongly on the site.
23:19 ^🔗	JAA	Usually, there is some stuff you need to skip.
23:20 ^🔗	JAA	Also, you'll generally want to retrieve images etc. also if they're not on the same domain.
23:20 ^🔗	Ceryn	Yes, I thought about that. And maybe the page you're archiving links to other pages you ought to have too?
23:21 ^🔗	JAA	Yeah, that's what ArchiveBot does by default, retrieving one extra layer of pages "around" the actual target for context.
23:21 ^🔗	Ceryn	(Maybe just the specific pages linked to, and not the entire thing. So a follow-links number of 2 or something.)
23:21 ^🔗	Ceryn	Cool.
23:22 ^🔗	Ceryn	What stuff would you need to skip?
23:24 ^🔗	JAA	Misparsed stuff (e.g. from wpull's JavaScript "parser", which just extracts anything that looks like a path), infinite loops (e.g. calendars), share links, other useless stuff (e.g. links that require an account), etc.
23:24 ^🔗	JAA	Sometimes, you get session IDs in the URL, which you also need to handle carefully to not retrieve everything hundreds of times.
23:26 ^🔗	JAA	Then there's specific stuff like archiving forums where you might want to skip the links for individual posts depending on how many posts there are.
23:26 ^🔗	godane	so i'm now down to $35 on my patreon: https://www.patreon.com/godane
23:35 ^🔗	Ceryn	Hm. Is manual work usually required for scraping a site, then?
23:35 ^🔗	Ceryn	Apart from starting the scraper.
23:36 ^🔗	JAA	Mostly just adding ignore patterns and fiddling with the concurrency and delay settings.
23:37 ^🔗	JAA	With plain wpull, that's a bit of a pain though.
23:37 ^🔗	JAA	So you might want to look into grab-site.
23:37 ^🔗	JAA	Not sure how easily that is customisable though.
23:39 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
23:41 ^🔗		pizzaiolo has joined #archiveteam-bs
23:41 ^🔗	Ceryn	Huh. That's a lot of ignore patterns grab-site has.
23:42 ^🔗	JAA	Yup
23:43 ^🔗	Ceryn	Ignore patterns seem to be a pain in the ass.
23:44 ^🔗	Ceryn	By the time you realise you need an ignore pattern, surely you'll have accumulated a ton of crap?
23:47 ^🔗	JAA	That's usually how it goes, yes.
23:48 ^🔗	Ceryn	And then... You start over? Because you don't want all that crap?
23:48 ^🔗	Ceryn	Or maybe you can just prune it?
23:48 ^🔗	JAA	Depends on your goal. We usually just leave it.
23:49 ^🔗	JAA	But yeah, you could filter the WARC.
23:50 ^🔗	Ceryn	Shit. That's going to be a nightmare.
23:50 ^🔗	godane	so archive.org is not working for me
23:50 ^🔗	Ceryn	How do these bad patterns work? Infinite length urls? Or circular links?
23:50 ^🔗	JAA	godane: Yep, they're down since at least two hours.
23:50 ^🔗	JAA	No word yet on what's going on.
23:52 ^🔗	JAA	Ceryn: Basically, the stuff I listed above. Circular links are already handled by wpull. Infinite loops need to be handled manually in most cases.
23:53 ^🔗	Ceryn	Oh.
23:53 ^🔗	godane	ok then
23:54 ^🔗	Ceryn	Maybe a recursion depth could help too.
23:54 ^🔗	JAA	Yeah, depends on the site really.
23:54 ^🔗	Ceryn	Tough to figure out how far it should go though.
23:54 ^🔗	Ceryn	Yeah. I had hoped I could automate this pretty much fully.
23:54 ^🔗	dashcloud	as you've probably noticed, site-grabbing is more of an art than a science, but the more you do it, the better you can get at it
23:55 ^🔗	JAA	Yep, this.
23:55 ^🔗	Ceryn	Heh, yeah.
23:55 ^🔗	JAA	And if you just want to regrab the same site(s) periodically, you can probably automate most of it.
23:55 ^🔗	Ceryn	I suppose it ought to only be painful the first time.
23:55 ^🔗	Ceryn	There's a joke in here somewhere.
23:55 ^🔗	dashcloud	you're facing much the same problem browser vendors do- they have to support tons of crazy ideas, and things that should never have seen the light of day
23:56 ^🔗	JAA	<marquee>I'm not sure what you're talking about.</marquee>
23:57 ^🔗	dashcloud	you didn't like the marquee + midi music aesthestic of the 90s?
23:57 ^🔗	Frogging	it's better than a lot of what we have now tbh
23:57 ^🔗	JAA	Yeah, that's probably true.
23:57 ^🔗	DrasticAc	I use marquee whenever possible
23:58 ^🔗	Ceryn	Haha. I had forgotten (repressed?) that effect.
23:58 ^🔗	DrasticAc	It can still be used by all major browsers, even though I think they'll complain in the console if you do

irclogger-viewer