#archiveteam-bs 2017-10-29,Sun

↑back Search

Time Nickname Message
00:50 πŸ”— Honno has joined #archiveteam-bs
01:17 πŸ”— Mateon1 has quit IRC (Read error: Operation timed out)
01:30 πŸ”— drumstick has quit IRC (Ping timeout: 360 seconds)
01:30 πŸ”— drumstick has joined #archiveteam-bs
02:00 πŸ”— Honno has quit IRC (Read error: Operation timed out)
02:16 πŸ”— hook54321 I've been recording a camera feed from a bar in Catalonia for hours... But I'm still on the fence on if I should upload it to archive.org or not.
02:24 πŸ”— schbirid has quit IRC (Ping timeout: 255 seconds)
02:36 πŸ”— schbirid has joined #archiveteam-bs
02:37 πŸ”— pizzaiolo has joined #archiveteam-bs
03:30 πŸ”— godane so this guy has a interesting set of vhs : https://www.ebay.com/sch/VHS-Video-Tapes/149960/m.html?_nkw=&_armrs=1&_ipg=&_from=&_ssn=froggyholler
03:30 πŸ”— godane i'm only interested in one tape that said Dog Day Afternoon with time of 1h 57m
03:30 πŸ”— godane on TB
03:31 πŸ”— godane *TBS
03:35 πŸ”— hook54321 I wonder what would happen if we set up something like a GoFundMe so we could obtain stuff like that
03:43 πŸ”— godane i do have a patreon page: https://www.patreon.com/godane
03:43 πŸ”— godane hook54321: its how i got the last set of tapes from ebay
03:55 πŸ”— godane SketchCow: i'm uploading the original broadcast of Dinotopia so don't upload my vhs stuff for the next day
03:56 πŸ”— godane its about 17,272,156,160 in size
04:19 πŸ”— qw3rty113 has joined #archiveteam-bs
04:22 πŸ”— Stilett0 has joined #archiveteam-bs
04:25 πŸ”— qw3rty112 has quit IRC (Read error: Operation timed out)
04:32 πŸ”— Mateon1 has joined #archiveteam-bs
04:33 πŸ”— phillipsj I saw a few episodes of that.. it was OK, but not great.
04:33 πŸ”— * phillipsj is getting old.
04:58 πŸ”— drumstick has quit IRC (Ping timeout: 255 seconds)
04:58 πŸ”— drumstick has joined #archiveteam-bs
06:20 πŸ”— pizzaiolo has quit IRC (Ping timeout: 246 seconds)
06:41 πŸ”— SketchCow Great
08:51 πŸ”— hook54321 I think I've found some Catalonia radio stations that can listened to online, however I've been blocked from them (I'm pretty sure for trying different ports), so I can't record them. :/
09:39 πŸ”— mls has joined #archiveteam-bs
09:42 πŸ”— mls has left
09:45 πŸ”— hook54321 https://www.youtube.com/watch?v=cfmiNyneO88
09:53 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)
10:15 πŸ”— JAA SketchCow: Is there anything that can be done to speed up transfers to FOS? My ArchiveBot pipeline has trouble keeping up as the uploads average only 2-3 MB/s. Or should I look into uploading to IA directly instead?
10:16 πŸ”— Honno has joined #archiveteam-bs
10:44 πŸ”— hook54321 If anyone is interested in joining a discord server where there are people from Catalonia, dm me and if I'm awake I'll probably send you the link, I might ask you a few questions about why you want to join it.
10:50 πŸ”— Aerochrom has joined #archiveteam-bs
12:25 πŸ”— nepeat has quit IRC (ZNC 1.6.5 - http://znc.in)
12:50 πŸ”— will has joined #archiveteam-bs
13:45 πŸ”— pizzaiolo has joined #archiveteam-bs
14:19 πŸ”— Aoede has quit IRC (Ping timeout: 255 seconds)
14:20 πŸ”— Aoede has joined #archiveteam-bs
14:20 πŸ”— Aoede has quit IRC (Connection closed)
14:20 πŸ”— Aoede has joined #archiveteam-bs
14:25 πŸ”— fie has quit IRC (Quit: Leaving)
14:40 πŸ”— drumstick has quit IRC (Read error: Operation timed out)
15:13 πŸ”— pizzaiolo has quit IRC (Remote host closed the connection)
15:48 πŸ”— SketchCow JAA: Noo
15:48 πŸ”— SketchCow JAA: Also Noo
15:49 πŸ”— SketchCow FOS is slow, is about to be replaced by another FOS
15:54 πŸ”— JAA Ok, sweet.
15:54 πŸ”— JAA Any idea when that'll happen?
16:04 πŸ”— SketchCow Soon?
16:20 πŸ”— julius_ is now known as jschwart
16:21 πŸ”— jschwart I've made a collection of discs now that I can probably sent to the internet archive, are tehre any instructions online for that?
16:22 πŸ”— jschwart another thing I have is old hardware with manuals, discs, etc. does anybody have any suggestion on what would be nice to do with that?
16:22 πŸ”— jschwart and what about software discs which came with relatively big books as manuals?
16:24 πŸ”— SketchCow Describe discs in this context.
16:26 πŸ”— jschwart I have for instance Klik & Play, an old tool to make games, it came with quite a big book
16:26 πŸ”— jschwart also Superlogo which came with a (Dutch) book and discs
17:06 πŸ”— pizzaiolo has joined #archiveteam-bs
17:31 πŸ”— Specular has joined #archiveteam-bs
17:32 πŸ”— Specular asking again in case someone here knows: is it possible to save a full list of results for a generic 'site:' query from Google?
17:33 πŸ”— dashcloud Specular: not completely automated- Google quickly hits you with captchas and bans if you try to scrape them
17:33 πŸ”— Specular god damn it
17:33 πŸ”— dashcloud but there's a number of tools you can use in browser
17:33 πŸ”— Specular a site recently hid an entire section of its siteβ€”according to one user around 36 million postsβ€”and I'm not sure for how long Google will retain the results of its previous crawls
17:34 πŸ”— dashcloud apparently the thing you want to do is very popular in the SEO community, so there's a number of extensions for this, and guides on doing it
17:35 πŸ”— Specular dashcloud, what should I be entering when searching? Couldn't find much when I tried last
17:35 πŸ”— Specular (that is to find more about how to do this)
17:35 πŸ”— dashcloud you probably want download google search results to csv
17:36 πŸ”— Specular for an enormous scrape wouldn't the size become unmanageably large to open?
17:38 πŸ”— Specular I'm not actually sure if this would cause problems but thought I'd ask
17:38 πŸ”— dashcloud possibly- then you'd use a different tool to open the file
17:38 πŸ”— dashcloud notepad++ , a database, or possibly LibreOffice/Excel
17:39 πŸ”— dashcloud there's this Chrome extension which may or may not work: https://chrome.google.com/webstore/detail/linkclump/lfpjkncokllnfokkgpkobnkbkmelfefj?hl=en
17:40 πŸ”— dashcloud here's a promising firefox extension: https://jurnsearch.wordpress.com/2012/01/27/how-to-extract-google-search-results-with-url-title-and-snippet-in-a-csv-file/
17:41 πŸ”— dashcloud jschwart: if you want to upload stuff yourself to the Internet Archive, I can give you some basic guidelines
17:45 πŸ”— dashcloud if you'd prefer to send stuff to IA, ask SketchCow for the mailing address
17:49 πŸ”— Specular Google seems like it might offer a way to grab this officially via a service called the 'Search Console', but to get over a certain number of results one needs the pro version. Perhaps someone on the aforementioned site has a subscription.
18:00 πŸ”— benuski has joined #archiveteam-bs
18:25 πŸ”— Specular has quit IRC (be back later)
18:32 πŸ”— jschwart dashcloud: I'd like to free up the physical space as well
18:33 πŸ”— jschwart so if I could send things somewhere where they would be useful, I'd rather do that
19:03 πŸ”— schbirid2 has joined #archiveteam-bs
19:04 πŸ”— schbirid has quit IRC (Read error: Connection reset by peer)
19:59 πŸ”— zhongfu has quit IRC (Ping timeout: 260 seconds)
20:05 πŸ”— zhongfu has joined #archiveteam-bs
20:51 πŸ”— benuski has quit IRC (Read error: Operation timed out)
20:54 πŸ”— BlueMaxim has joined #archiveteam-bs
21:07 πŸ”— benuski has joined #archiveteam-bs
21:47 πŸ”— BartoCH has quit IRC (Quit: WeeChat 1.9.1)
21:52 πŸ”— BartoCH has joined #archiveteam-bs
22:12 πŸ”— c4rc4s has quit IRC (Quit: words)
22:25 πŸ”— c4rc4s has joined #archiveteam-bs
22:32 πŸ”— Ceryn has joined #archiveteam-bs
22:34 πŸ”— Ceryn Hey. Do you guys archive websites? If yes, what tools do you use? Custom ones? wget/warc?
22:35 πŸ”— Ceryn I'm looking at some available options. It seems I might have to write my own to get the features I want (such as keeping the archive up to date, handling changes in pages, ...).
22:35 πŸ”— JAA That's what we do. We mostly use wpull (e.g. ArchiveBot and many people using it manually) or wget-lua (in the warrior) and write WARCs.
22:36 πŸ”— Ceryn Hm. warrior?
22:36 πŸ”— JAA Our tool to launch Distributed Preservation of Service attacks
22:36 πŸ”— JAA http://archiveteam.org/index.php?title=Warrior
22:36 πŸ”— Ceryn Haha.
22:38 πŸ”— JAA wpull's quite nice if you need to tweak its behaviour. It's written in Python, i.e. you can monkeypatch everything.
22:38 πŸ”— Ceryn I'll look into Warrior. Thanks.
22:38 πŸ”— Ceryn wpull's a bot command?
22:38 πŸ”— JAA wpull's a replacement for wget.
22:38 πŸ”— JAA It's mostly a reimplementation of wget actually, but in Python rather than nasty unmaintainable C code.
22:39 πŸ”— Ceryn I want to defend C, but I suppose Python *is* the better language in this case.
22:39 πŸ”— JAA Nothing wrong with C in general, but from what I've heard, the wget code is really ugly.
22:40 πŸ”— Ceryn This one? https://github.com/chfoo/wpull
22:40 πŸ”— JAA Yep
22:40 πŸ”— Ceryn Cool.
22:40 πŸ”— DrasticAc There are C# solutions that exist https://github.com/antiufo/Shaman.Scraping
22:41 πŸ”— JAA You'll want to use either version 1.2.3 though or the fork by FalconK. 2.0.1 is really buggy.
22:41 πŸ”— Ceryn And why do you want WARC files? Solely for Time Machine compatibility?
22:41 πŸ”— DrasticAc Although it needs work, I had to hack it to get it compiled
22:41 πŸ”— JAA Wayback Machine*
22:41 πŸ”— Ceryn Right.
22:41 πŸ”— JAA Well, WARC's a nice format that saves all the relevant metadata (request and response headers etc.).
22:42 πŸ”— DrasticAc It's easy enough to playback the warc if you want to scrap the HTML.
22:42 πŸ”— DrasticAc Or whatever it is your getting
22:43 πŸ”— Ceryn So you want the WARC metadata to be able to re-do scrapes?
22:43 πŸ”— JAA Yep. You can run pywb or a similar tool to playback a WARC and browse the site just like it was the live site.
22:43 πŸ”— JAA No, because metadata is just as important as the content itself.
22:43 πŸ”— Ceryn Hm. Interesting. I was thinking of just storing HTML pages for ease of use, but I'll look into pywb or similar too.
22:43 πŸ”— JAA Also, you need at least two pieces of metadata (URL and date) to be able to do anything meaningful with the archive at all.
22:44 πŸ”— JAA Or to even call it an archive, in my opinion.
22:44 πŸ”— JAA Otherwise it's just a copy (and possibly a modified one to get images etc. to work).
22:45 πŸ”— Ceryn Oh? ELI5 why full meta data, to the extent WARC provides, is as important as the data? I can see uses and nice-to-haves, but I don't know about must-haves.
22:45 πŸ”— DrasticAc Oh yeah, I'm just saying if you _did_ want to scrap it after the fact to mine it, it's much easier to use the WARC than the live site (which will probably be down)
22:45 πŸ”— DrasticAc So getting the warc first makes total sense in any case
22:46 πŸ”— DrasticAc btw, is archive.org down? I can't connect to the S3 and the main website seems down.
22:46 πŸ”— Ceryn So WARCs preserve everything in mint condition. Saves me from doing a lot of patchwork myself.
22:46 πŸ”— drumstick has joined #archiveteam-bs
22:47 πŸ”— Ceryn DrasticAc: archive.org looks like it's timing out here.
22:48 πŸ”— JAA Yep, down at the moment. THere was some talk about it earlier in #archivebot.
22:50 πŸ”— JAA WARC doesn't actually contain that much metadata. It contains the URL and timestamp, a record type (to distinguish between requests, responses, and other things), the IP address to which a host resolved at that point in time, and hash digests for verifying that the contents aren't corrupted.
22:52 πŸ”— JAA You'll want to preserve HTTP requests and response headers to be able to reconstruct what you actually archived. For example, if a site modifies its pages based on user agent detection and you don't store the HTTP request, you'll never know what triggered that response.
22:52 πŸ”— Ceryn DrasticAc: I hesitate to use a C# solution though. Maybe if it does exactly what I want. But otherwise I risk having to write C#.
22:52 πŸ”— DrasticAc Yeah, I agree. Wish it was F#.
22:52 πŸ”— JAA Response headers contain additional information, e.g. about the server software used or when a resource was last modified.
22:53 πŸ”— JAA I would argue that all of this is very important information.
22:54 πŸ”— Ceryn I suppose I can't say it's non-important.
22:54 πŸ”— Ceryn My main use case, I think, would be having copies of sites to go back to for nostalgia or something.
22:55 πŸ”— Ceryn But maybe WARCs' a better way to go about it anyway.
22:56 πŸ”— JAA Yeah, I think they're superior in general. Having everything in a single file (or a few of them) is much easier to handle than thousands of files spread across directories (e.g. if you use wget --mirror).
22:58 πŸ”— Ceryn Are you supposed to dump all logs for a scrape in a single WARC file? Or size-capped WARC files?
22:58 πŸ”— JAA They're also compressed, which can be a *massive* space saver.
23:00 πŸ”— JAA Usually size-capped files of a few GB (I use 2 GiB for my own grabs, ArchiveBot uses 5 GiB by default), though I'm not entirely sure why. It shouldn't really be a problem to have bigger files.
23:00 πŸ”— JAA You can always split or merge them, too.
23:00 πŸ”— Ceryn Right.
23:02 πŸ”— Ceryn If you want to update the archive, do you just WARC from scratch? Or re-visiting links from the WARC data?
23:02 πŸ”— Ceryn Maybe we're into custom code here.
23:03 πŸ”— JAA Depends on what you want to do exactly.
23:03 πŸ”— odemg Doesn't seem like just the site is down, my uploads stopped over torrent and python
23:03 πŸ”— Ceryn Well, I want to keep my archive up to date while still saving older copies of pages.
23:04 πŸ”— JAA So WARC supports deduplication by referencing old records. You could make use of that.
23:04 πŸ”— JAA You can of course also skip any URLs that you know (or assume) haven't changed in the meantime.
23:05 πŸ”— JAA On playback, you'd get the old version in that case.
23:05 πŸ”— JAA But yes, this would probably require custom code.
23:05 πŸ”— JAA The deduplication part, I mean.
23:05 πŸ”— JAA Heritrix might support it though.
23:06 πŸ”— Ceryn Hm. WARC does sound better and better.
23:06 πŸ”— jschwart has quit IRC (Quit: Konversation terminated!)
23:07 πŸ”— Ceryn Sweet.
23:09 πŸ”— Ceryn Perhaps I might want to use WARC anyway, haha. I think I might still end up writing relevant code, but I can probably base a whole lot on the WARC library and maybe wpull.
23:10 πŸ”— kvieta has quit IRC (Quit: greedo shot first)
23:11 πŸ”— Ceryn So thanks for that.
23:12 πŸ”— Ceryn When you guys archive stuff, is it for your own use/hoard?
23:12 πŸ”— kvieta has joined #archiveteam-bs
23:15 πŸ”— JAA I'm sure people in here archive stuff for their own use, but most of it goes to the Internet Archive for public consumption.
23:16 πŸ”— Ceryn And that works by people just archiving and uploading whatever?
23:17 πŸ”— JAA I'm not entirely sure how it works. I still haven't uploaded my grabs to the IA because I'm too lazy to figure out how to do it.
23:17 πŸ”— Ceryn Haha okay. That'll be me once it's all up and running.
23:17 πŸ”— JAA I think you need to set an attribute when uploading the files. Not sure if it happens automatically afterwards or not.
23:19 πŸ”— Ceryn Do you archive everything on the domain for a given domain name? All resources, images, video, objects?
23:19 πŸ”— JAA Depends strongly on the site.
23:19 πŸ”— JAA Usually, there is some stuff you need to skip.
23:20 πŸ”— JAA Also, you'll generally want to retrieve images etc. also if they're not on the same domain.
23:20 πŸ”— Ceryn Yes, I thought about that. And maybe the page you're archiving links to other pages you ought to have too?
23:21 πŸ”— JAA Yeah, that's what ArchiveBot does by default, retrieving one extra layer of pages "around" the actual target for context.
23:21 πŸ”— Ceryn (Maybe just the specific pages linked to, and not the entire thing. So a follow-links number of 2 or something.)
23:21 πŸ”— Ceryn Cool.
23:22 πŸ”— Ceryn What stuff would you need to skip?
23:24 πŸ”— JAA Misparsed stuff (e.g. from wpull's JavaScript "parser", which just extracts anything that looks like a path), infinite loops (e.g. calendars), share links, other useless stuff (e.g. links that require an account), etc.
23:24 πŸ”— JAA Sometimes, you get session IDs in the URL, which you also need to handle carefully to not retrieve everything hundreds of times.
23:26 πŸ”— JAA Then there's specific stuff like archiving forums where you might want to skip the links for individual posts depending on how many posts there are.
23:26 πŸ”— godane so i'm now down to $35 on my patreon: https://www.patreon.com/godane
23:35 πŸ”— Ceryn Hm. Is manual work usually required for scraping a site, then?
23:35 πŸ”— Ceryn Apart from starting the scraper.
23:36 πŸ”— JAA Mostly just adding ignore patterns and fiddling with the concurrency and delay settings.
23:37 πŸ”— JAA With plain wpull, that's a bit of a pain though.
23:37 πŸ”— JAA So you might want to look into grab-site.
23:37 πŸ”— JAA Not sure how easily that is customisable though.
23:39 πŸ”— pizzaiolo has quit IRC (Remote host closed the connection)
23:41 πŸ”— pizzaiolo has joined #archiveteam-bs
23:41 πŸ”— Ceryn Huh. That's a lot of ignore patterns grab-site has.
23:42 πŸ”— JAA Yup
23:43 πŸ”— Ceryn Ignore patterns seem to be a pain in the ass.
23:44 πŸ”— Ceryn By the time you realise you need an ignore pattern, surely you'll have accumulated a ton of crap?
23:47 πŸ”— JAA That's usually how it goes, yes.
23:48 πŸ”— Ceryn And then... You start over? Because you don't want all that crap?
23:48 πŸ”— Ceryn Or maybe you can just prune it?
23:48 πŸ”— JAA Depends on your goal. We usually just leave it.
23:49 πŸ”— JAA But yeah, you could filter the WARC.
23:50 πŸ”— Ceryn Shit. That's going to be a nightmare.
23:50 πŸ”— godane so archive.org is not working for me
23:50 πŸ”— Ceryn How do these bad patterns work? Infinite length urls? Or circular links?
23:50 πŸ”— JAA godane: Yep, they're down since at least two hours.
23:50 πŸ”— JAA No word yet on what's going on.
23:52 πŸ”— JAA Ceryn: Basically, the stuff I listed above. Circular links are already handled by wpull. Infinite loops need to be handled manually in most cases.
23:53 πŸ”— Ceryn Oh.
23:53 πŸ”— godane ok then
23:54 πŸ”— Ceryn Maybe a recursion depth could help too.
23:54 πŸ”— JAA Yeah, depends on the site really.
23:54 πŸ”— Ceryn Tough to figure out how far it should go though.
23:54 πŸ”— Ceryn Yeah. I had hoped I could automate this pretty much fully.
23:54 πŸ”— dashcloud as you've probably noticed, site-grabbing is more of an art than a science, but the more you do it, the better you can get at it
23:55 πŸ”— JAA Yep, this.
23:55 πŸ”— Ceryn Heh, yeah.
23:55 πŸ”— JAA And if you just want to regrab the same site(s) periodically, you can probably automate most of it.
23:55 πŸ”— Ceryn I suppose it ought to only be painful the first time.
23:55 πŸ”— Ceryn There's a joke in here somewhere.
23:55 πŸ”— dashcloud you're facing much the same problem browser vendors do- they have to support tons of crazy ideas, and things that should never have seen the light of day
23:56 πŸ”— JAA <marquee>I'm not sure what you're talking about.</marquee>
23:57 πŸ”— dashcloud you didn't like the marquee + midi music aesthestic of the 90s?
23:57 πŸ”— Frogging it's better than a lot of what we have now tbh
23:57 πŸ”— JAA Yeah, that's probably true.
23:57 πŸ”— DrasticAc I use marquee whenever possible
23:58 πŸ”— Ceryn Haha. I had forgotten (repressed?) that effect.
23:58 πŸ”— DrasticAc It can still be used by all major browsers, even though I think they'll complain in the console if you do

irclogger-viewer