#archiveteam-bs 2018-08-27,Mon

↑back Search

Time Nickname Message
00:36 πŸ”— BlueMax has joined #archiveteam-bs
01:04 πŸ”— bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
01:24 πŸ”— omglolbah has quit IRC (Read error: Operation timed out)
02:43 πŸ”— bithippo has joined #archiveteam-bs
02:52 πŸ”— bithippo has quit IRC (Read error: Connection reset by peer)
03:09 πŸ”— Flashfire can anyone tell me this about archivebot viewer
03:09 πŸ”— Flashfire <Flashfire> why are some jobs given a url identifier as well and some arent
03:09 πŸ”— Flashfire 13:07 <Flashfire> for reference 10rtx is given a url
03:09 πŸ”— Flashfire 13:07 <Flashfire> but 10t98 isnt
03:52 πŸ”— joepie91 has quit IRC (Read error: Operation timed out)
03:52 πŸ”— Frogging has quit IRC (Read error: Operation timed out)
03:54 πŸ”— c4rc4s has quit IRC (Adios)
03:54 πŸ”— Frogging has joined #archiveteam-bs
03:55 πŸ”— joepie91 has joined #archiveteam-bs
03:55 πŸ”— davidar has joined #archiveteam-bs
03:57 πŸ”— jspiros has quit IRC (hub.efnet.us irc.colosolutions.net)
03:57 πŸ”— JAA has quit IRC (hub.efnet.us irc.colosolutions.net)
03:57 πŸ”— wabu has quit IRC (hub.efnet.us irc.colosolutions.net)
03:57 πŸ”— Petri152 has quit IRC (hub.efnet.us irc.colosolutions.net)
03:57 πŸ”— zyphlar has quit IRC (hub.efnet.us irc.colosolutions.net)
03:58 πŸ”— s4y has joined #archiveteam-bs
03:58 πŸ”— * s4y JAA et. al: Re.
03:59 πŸ”— s4y Whoops, didn't mean to hit enter just yet.
03:59 πŸ”— Flashfire lol
03:59 πŸ”— odemg has quit IRC (Ping timeout: 260 seconds)
03:59 πŸ”— s4y JAA et. al: Re. URLTeam, should I pick another project if there's so little work? I like leaving it on auto, but if that's not the best option right now, I'll do something else.
04:00 πŸ”— s4y s/do something else/manually pick a different project for my warrior to work on/
04:11 πŸ”— odemg has joined #archiveteam-bs
04:17 πŸ”— s4y Whelp, I think #newsgrabber is out? It failed to install itself and their IRC topic says that the warrior is no longer supported.
04:17 πŸ”— s4y But it's still listed in my warrior and on the wiki?
04:17 πŸ”— * s4y shrugs
04:19 πŸ”— s4y Neither FTP-GOV nor WikiTeam have any work. I'm starting to question everything.
04:31 πŸ”— Mateon1 has quit IRC (se.hub irc.underworld.no)
04:31 πŸ”— Aoede has quit IRC (se.hub irc.underworld.no)
04:31 πŸ”— i0npulse has quit IRC (se.hub irc.underworld.no)
04:31 πŸ”— kiskaBak has quit IRC (se.hub irc.underworld.no)
04:31 πŸ”— Flashfire has quit IRC (se.hub irc.underworld.no)
04:31 πŸ”— hook54321 has quit IRC (se.hub irc.underworld.no)
04:47 πŸ”— Mateon1 has joined #archiveteam-bs
04:48 πŸ”— ReimuHaku has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
05:06 πŸ”— ReimuHaku has joined #archiveteam-bs
05:08 πŸ”— c4rc4s has joined #archiveteam-bs
05:08 πŸ”— zyphlar has joined #archiveteam-bs
05:08 πŸ”— JAA has joined #archiveteam-bs
05:08 πŸ”— swebb sets mode: +o JAA
05:08 πŸ”— bakJAA sets mode: +o JAA
05:12 πŸ”— kiskabak2 has joined #archiveteam-bs
05:12 πŸ”— jspiros has joined #archiveteam-bs
05:14 πŸ”— Darkstar has quit IRC (Read error: Operation timed out)
05:14 πŸ”— Darkstar has joined #archiveteam-bs
05:16 πŸ”— m007a83_ has joined #archiveteam-bs
05:16 πŸ”— Atom-- has joined #archiveteam-bs
05:20 πŸ”— Atom__ has quit IRC (Read error: Operation timed out)
05:20 πŸ”— BlueMax has quit IRC (Read error: Operation timed out)
05:21 πŸ”— Fredgido_ has quit IRC (Read error: Operation timed out)
05:24 πŸ”— m007a83 has quit IRC (Read error: Operation timed out)
05:24 πŸ”— m007a83_ is now known as m007a83
05:42 πŸ”— faolingfa I am considering setting up an archivebot pipeline to get over the "nothing to contribute" issue described above. The install instructions seem a bit "manual" but if I can shove them in a Docker image, I imagine it might be smooth enough?
05:47 πŸ”— Flashfire has joined #archiveteam-bs
05:48 πŸ”— kiskaBak has joined #archiveteam-bs
05:48 πŸ”— i0npulse has joined #archiveteam-bs
05:57 πŸ”— hook54321 has joined #archiveteam-bs
06:01 πŸ”— BlueMax has joined #archiveteam-bs
06:21 πŸ”— bsmith093 has quit IRC (Leaving.)
06:24 πŸ”— bsmith093 has joined #archiveteam-bs
06:37 πŸ”— caff has quit IRC (Read error: Operation timed out)
07:25 πŸ”— schbirid has joined #archiveteam-bs
08:43 πŸ”— faolingfa Okay, maybe not too straightforward to set that up. I will just wait for better days.
09:10 πŸ”— omglolbah has joined #archiveteam-bs
09:29 πŸ”— caff has joined #archiveteam-bs
09:30 πŸ”— caff has quit IRC (Read error: Connection reset by peer)
09:32 πŸ”— omglolbah has quit IRC (Ping timeout: 260 seconds)
09:44 πŸ”— omglolbah has joined #archiveteam-bs
09:55 πŸ”— Mateon1 has quit IRC (Ping timeout: 268 seconds)
09:55 πŸ”— Mateon1 has joined #archiveteam-bs
11:12 πŸ”— JAA Flashfire: The ArchiveBot viewer is semi-broken. It doesn't list all jobs and/or files. There's an issue on the ArchiveBot GitHub repository with details. I wrote an alternative, which is located at https://github.com/JustAnotherArchivist/archivebot-archives . This works, but it has much fewer features than the viewer (and no proper interface); you can search for the first 5 characters of a job ID or
11:12 πŸ”— JAA the domain to find jobs (though domains don't always work for some reason on the GitHub interface).
11:13 πŸ”— Flashfire I prefer the viewer it just annoys me that I can’t always see if an exact url
11:13 πŸ”— Flashfire has been grabbed
11:13 πŸ”— JAA s4y: Yeah, I don't think there's any active project working in the warrior VM aside from URLTeam at the moment. As mentioned yesterday, new projects are in the pipeline, but they're not quite ready yet.
11:13 πŸ”— Flashfire But hey first word problems
11:14 πŸ”— JAA Ah yes, full URLs have never been searchable in the viewer IIRC, only the domain name.
11:14 πŸ”— JAA And yeah, the viewer is definitely better than my crappy git repository. But it doesn't work correctly, so...
11:15 πŸ”— Flashfire Yeah I know only the domain name but sub items list the exact URL somtiems
11:16 πŸ”— Flashfire 13:07 <Flashfire> why are some jobs given a url identifier as well and some arent
11:16 πŸ”— Flashfire 13:07 <Flashfire> for reference 10rtx is given a url
11:16 πŸ”— Flashfire 13:07 <Flashfire> but 10t98 isnt
11:16 πŸ”— Flashfire JAA for reference
11:17 πŸ”— JAA Yeah, I know.
11:17 πŸ”— JAA Just another bug in the viewer.
11:18 πŸ”— JAA For the record, my alternative doesn't know about the full URL at all. It doesn't download the JSON files; it's merely an index of all files in the ArchiveBot IA collection.
11:21 πŸ”— JAA faolingfa: ArchiveBot pipelines are a bit tricky to set up. Also, we generally only accept pipelines from people who have been around for a while since an ArchiveBot pipeline is a very long-term commitment (the pipeline has to be up 24/7 for months, typically).
11:21 πŸ”— JAA Someone has been experimenting with shoving ArchiveBot into Docker a few months ago, and it didn't work well. We never got around to debug it though.
11:25 πŸ”— chr1sm You got a link to the start of their work re: dockerising ArchiveBot?
11:27 πŸ”— JAA chr1sm: I think it's this: https://github.com/luckcolors/ArchiveBot/commit/135eb161fca18272d9687dc8adea4c63c5f123ef
11:27 πŸ”— chr1sm @JAA cheers!
11:28 πŸ”— JAA I don't remember exactly what issues we had. IPv4/IPv6 was one of them.
11:29 πŸ”— JAA In the end, we decided we need a separate ArchiveBot backend setup for testing stuff, which I've been meaning to set up for a while now but never got around to it.
11:44 πŸ”— ppsym has joined #archiveteam-bs
11:51 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
11:51 πŸ”— wp494 has joined #archiveteam-bs
11:52 πŸ”— FluffyFox has joined #archiveteam-bs
11:53 πŸ”— tsr_ has joined #archiveteam-bs
11:53 πŸ”— Frogging has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— PurpleSym has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— K4k has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— VoynichCr has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— Tenebrae has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— MrRadar2 has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— BnAboyZ has quit IRC (se.hub irc.efnet.nl)
11:53 πŸ”— tsr has quit IRC (se.hub irc.efnet.nl)
11:54 πŸ”— K4k_ has joined #archiveteam-bs
11:55 πŸ”— Tenebrae has joined #archiveteam-bs
11:59 πŸ”— MrRadar2 has joined #archiveteam-bs
11:59 πŸ”— svchfoo1 sets mode: +o MrRadar2
11:59 πŸ”— VoynichCr has joined #archiveteam-bs
12:08 πŸ”— FluffyFox is now known as Frogging
12:09 πŸ”— tsr_ is now known as tsr
12:09 πŸ”— ppsym is now known as PurpleSym
12:17 πŸ”— faolingfa What is the distinction between archivebot and warrior jobs? Being new here, they just seem like two different ways to organize archiving work. What requires one to be up 24/7 for months, whereas the other seems rather more flexible?
12:21 πŸ”— MrRadar2 has quit IRC (hub.efnet.us irc.efnet.nl)
12:21 πŸ”— VoynichCr has quit IRC (hub.efnet.us irc.efnet.nl)
12:21 πŸ”— Tenebrae has quit IRC (hub.efnet.us irc.efnet.nl)
12:22 πŸ”— Kaz archivebot covers small to medium sized sites, warrior jobs tend to be much bigger projects that archivebot wouldn't be able to handle / would run quicker with a team working on it
12:23 πŸ”— JAA faolingfa: ArchiveBot is a tool for recursively grabbing websites. As in, we throw in the homepage and it retrieves everything below it. A job on ArchiveBot is generally one website. Because some sites are *huge*, it can easily take months for jobs to complete. Due to how the system is designed, it's not possible to stop and restart the retrieval or to migrate it to another machine, so the pipelines
12:23 πŸ”— JAA must be up for months.
12:24 πŸ”— JAA And yeah, ArchiveBot and warrior are completely separate tools.
12:29 πŸ”— faolingfa I see. Thank your for the explanation!
12:32 πŸ”— Tenebrae has joined #archiveteam-bs
12:33 πŸ”— VoynichCr has joined #archiveteam-bs
12:33 πŸ”— MrRadar2 has joined #archiveteam-bs
12:33 πŸ”— svchfoo1 sets mode: +o MrRadar2
12:38 πŸ”— Aoede has joined #archiveteam-bs
12:41 πŸ”— fredgido has joined #archiveteam-bs
12:43 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
12:53 πŸ”— faolingfa So an AB pipeline 1) gets a job from the Redis server connected via SSH tunnel; 2) does a big wget job; 3) pushes result out using rsync. And the reason it needs high uptime is because step 2 takes forever and is not resumable. Did I get that right?
13:04 πŸ”— JAA Essentially yes. Minor details: ArchiveBot uses wpull, not wget. And the data is uploaded with rsync while the job is still running in chunks of 5 GiB (by default); most pipelines don't have enough storage to keep the full grab on disk until the end of the job.
13:04 πŸ”— JAA faolingfa: ^
13:19 πŸ”— MrRadar2 has quit IRC (Read error: Operation timed out)
14:06 πŸ”— MrRadar2 has joined #archiveteam-bs
14:06 πŸ”— svchfoo3 sets mode: +o MrRadar2
14:14 πŸ”— yuitimoth has quit IRC (Remote host closed the connection)
14:29 πŸ”— Dimtree has quit IRC (Read error: Operation timed out)
14:48 πŸ”— faolingfa I notice that wpull has "Graceful stopping; on-disk database resume" in its feature list; is it that wpull does not support resumes or that the ArchiveBot workflow that drives wpull does not?
14:49 πŸ”— JAA ArchiveBot's workflow doesn't. wpull absolutely does support that.
14:49 πŸ”— JAA (One of the reasons why it's awesome.)
14:57 πŸ”— faolingfa Is there some way to test drive a pipeline without hooking it up to the real backend? I am having a Jeremy Clarkson "how hard can it be" moment due to having been on vacation for 7 weeks and thus rather bored, so I am curious and would like to see the code actually working, in order to better think about making it resumable so that I could restart my systems for security updates once a month and still
14:57 πŸ”— faolingfa contribute.
15:02 πŸ”— JAA faolingfa: Yeah, that's exactly the reason why that Docker thing wasn't tested further. I need to set up a test/development backend, but I haven't had time for that.
15:02 πŸ”— JAA The code's all in the ArchiveTeam/ArchiveBot repository on GitHub though, and there are instructions for setting up the backend in INSTALL.backend or whatever it's called.
15:03 πŸ”— kiska I'll attempt something later, since I have no patience with a 48 hr initialisation of an raid array
15:06 πŸ”— faolingfa The backend is a Redis database + CouchDB database + ruby app for the actual API/ircbot + website, do I understand that right? I guess as long as I am in a "how hard can it be" mood I can try set that up for myself as well. I saw the pipeline code talk to Redis but what is the role of the CouchDB database?
15:07 πŸ”— JAA Something like that, yeah.
15:08 πŸ”— JAA I think the CouchDB stores the ignore sets and the user agents, but not entirely sure.
15:08 πŸ”— JAA I believe it used to store much more than that. The plan was to get rid of CouchDB entirely, but for whatever reason, that never happened.
15:10 πŸ”— JAA s/user agents/user agent aliases/
15:58 πŸ”— Dimtree has joined #archiveteam-bs
16:21 πŸ”— bitBaron has joined #archiveteam-bs
17:20 πŸ”— adinbied Anyone up for figuring out the best way to setup a warrior project for https://www.archiveteam.org/index.php?title=Angelfire ? The URL discovery script is about halfway done, and I've been trying to figure out the best way to grab stuff
17:28 πŸ”— adinbied Maybe each tracker 'item' could be all of a users URLs? And the pipeline would just grab and WARC as much as possible?
17:51 πŸ”— PurpleSym Wow, unbelievable. My Yahoo! Groups grab actually *finished*. I really thought it might take another year or so.
17:55 πŸ”— caff has joined #archiveteam-bs
18:15 πŸ”— JAA adinbied: How many URLs are we talking about here?
18:15 πŸ”— JAA PurpleSym: Oh, nice!
18:17 πŸ”— JAA adinbied: Ah, wiki says 3.9 million users. Hmm...
18:18 πŸ”— JAA I guess we could do users as items, yeah.
18:19 πŸ”— JAA Have it grab the sitemap, then all the pages, as well as index.html recursively.
18:21 πŸ”— Dimtree has quit IRC (Quit: Peace)
18:28 πŸ”— adinbied I've grabbed the sitemaps and am parsing all of the users individual sitemaps to get every URL - https://imgur.com/a/d8O3rt5 https://imgur.com/a/qz3816o
18:29 πŸ”— adinbied I posted a link to part of the URL grab here: https://archive.org/details/angelfireURLS_0x00-0x14 , which should give an idea of the data I'm working with
18:32 πŸ”— adinbied JAA, most users either only have 1 working URL or none at all - however some sites have hundreds of active URLs as seen when sorting the .urls files by size
18:34 πŸ”— jut So 1000 users per item?
18:37 πŸ”— adinbied That sounds like it should work - although someone with more knowledge than me can chime in if it isn't feasible. I also don't know how each item would be set up - the way I was thinking about it was having all of the user's sitemaps parsed and then the tracker gives out URLs, but I'm wondering if it might be better for the tracker to give out a username and then the warrior client grabs and parses the sitemap, then
18:37 πŸ”— adinbied proceeds with the grab
18:38 πŸ”— Dimtree has joined #archiveteam-bs
18:42 πŸ”— JAA Not sure if 1000 users per item works. The item names would get very long. I don't know if there are any limits on that.
18:42 πŸ”— JAA But yeah, we could group multiple users into an item.
18:43 πŸ”— JAA Yes, the tracker should hand out users. An example item using the URLs on the wiki might be "users:punk4/jori_loves_jackass:vevayaqo:planet/dumbass123:ab7/pledgecry".
18:45 πŸ”— JAA Then the client grabs http://www.angelfire.com/${user}/sitemap.xml, all URLs inside it, and maybe http://www.angelfire.com/${user}/ or .../index.html or something like that.
18:45 πŸ”— adinbied Alright, let me modify my discovery script to just grab users instead of downloading and parsing all of their sitemaps - that should be much faster than what it's currently doing
18:45 πŸ”— JAA Yeah, that should just be a matter of parsing sitemap-index-00.xml.gz through -ff.xml.gz.
18:47 πŸ”— Ceryn has joined #archiveteam-bs
18:53 πŸ”— adinbied OK, that just finished - uploading to IA now
18:59 πŸ”— Ceryn has quit IRC (Quit: WeeChat 1.4)
18:59 πŸ”— adinbied Here's all of the users: https://archive.org/details/angelfire-users-all_201808
19:01 πŸ”— adinbied There are 256 files, each with 15,000-16,000 users each
19:01 πŸ”— adinbied How many users should each item be? 2? 3? more?
19:55 πŸ”— faolingfa adinbied: Where can I see the source code of the Angelfire project? Are you willing to answer lots of potentially dumb questions about it? I am very curious to learn how all the automation works so I can one day hopefully find some time-efficient ways to contribute.
19:56 πŸ”— faolingfa Is it this https://github.com/adinbied/angelfire-items or is there more?
19:58 πŸ”— adinbied faolingfa, sure - as much as I can. I'm also relatively new as far as the technical stuff goes, but I'll do my best to answer what I can. There is currently no written code for the angelfire grab that will actually be used - the github link you posted was my attempt at scraping all of the sitemaps for every possible URL - although now that it's going to be a Warrior based project, that code won't be used.
19:59 πŸ”— adinbied The Dev Wiki page has some good info about how the Warrior works and some of the automation stuff, definitely check that out: https://www.archiveteam.org/index.php?title=Dev
20:06 πŸ”— faolingfa So if I understand it right, the idea is to make work items based on the Angelfire sitemaps. The tracker then distributes these items to warriors. Right? How/where exactly are these work items defined? Is it a one time process? Can more be added to a project over time based on updates to the target? Is there a standard way to make them? What is the data structure that defines an item?
20:11 πŸ”— adinbied Yup, that's correct. Items can be added to the tracker at any point, and the format depends on the pipeline.py and how it gets set up. Items are typically in the format: "itemtype:itemstart-itemstop" although that can change. For more in-depth info, ask in the #warrior channel - someone there might be able to give more info
20:14 πŸ”— chferfa has joined #archiveteam-bs
20:15 πŸ”— schbirid has quit IRC (Remote host closed the connection)
20:16 πŸ”— Jens has quit IRC (Remote host closed the connection)
20:17 πŸ”— Jens has joined #archiveteam-bs
20:17 πŸ”— faolingfa Does that mean an "item" is just a freeform string given to the pipeline.py script? The tracker hands out these strings for the script to parse and process?
20:19 πŸ”— adinbied Exactly - for example, the Quizlet project had items that were strings of "api:100000000-100000999", "api:100001000-100001999" etc.
20:21 πŸ”— adinbied There's a VirtualBox image that has a tracker ready to go so you can mess with stuff - to use it in conjunction with the pipeline.py scripts, simply change the tracker URL to localhost:9080 or whatever port is being forwarded by Virtualbox - then the two should be able to communicate
20:21 πŸ”— adinbied https://github.com/ArchiveTeam/archiveteam-dev-env
20:23 πŸ”— faolingfa And for Angelfire the script would take 1 or more users per item, run wget (or is it wpull?) on each, possibly with some custom Lue scripting for handling special cases, then pack the result up into a warc and upload it? Would it be one warc per item?
20:26 πŸ”— adinbied It would be wget with lua hooks - and yes, once it's done it would be uploaded. I'm pretty sure it would be one WARC per item, although in this case given each item has multiple users, I have no idea how that will work - arkiver / JAA usually are the ones helping out with the Warrior stuff - I'm still learning :)
20:28 πŸ”— JAA Yeah, there's normally one WARC per item.
20:30 πŸ”— JAA The WARCs are then combined into so-called mega-WARCs on the target, which get uploaded to IA.
20:33 πŸ”— adinbied Created a channel for the Angelfire project: #angelonfire
20:34 πŸ”— faolingfa Sounds pretty straightforward. What GitHub repository should I look at to see an example of some pipeline.py and wget lua scripts that would be similar in principle than what one would expect to be used on the Angelfire project?
20:35 πŸ”— adinbied No idea about ones that would be similar, but just browsing https://github.com/ArchiveTeam repo's that end in -grab should give a good idea about what real-world pipelines contain
20:44 πŸ”— faolingfa Okay. This lua support in wget is a feature of a custom fork of wget, yes? What is the lua scripting commonly used for? Filtering? Anything else?
20:50 πŸ”— faolingfa Aha, I found some more examples on the wiki. So you can also generate new URLs from it and do a bit more logic. Neat.
20:52 πŸ”— JAA Yeah, wget-lua is a fork of wget. The Lua scripts are used for filtering URLs, adding extra URLs, and handling error codes (e.g. ignore 404s but error out on 429).
20:53 πŸ”— Raccoon Is it being maintained by the same wget team? #wget on freenode
20:55 πŸ”— faolingfa Is it being maintained?
20:55 πŸ”— ivan wget-lua is a fork by alard, doubt anyone over at wget cares about it
20:56 πŸ”— JAA Yep, and it's effectively not maintained anymore. The last commit was in 2015.
21:00 πŸ”— Muad-Dib has quit IRC (Quit: ZNC - http://znc.in)
21:43 πŸ”— Flashfire Can we get a wiki updated in the apiary?
21:43 πŸ”— Flashfire https://wikiapiary.com/wiki/Chewiki Is shutting down soon
21:43 πŸ”— Flashfire Last complete backup was years ago
21:45 πŸ”— JAA Flashfire: #wikiteam is the right channel for that.
22:04 πŸ”— Smiley has quit IRC (Read error: Connection reset by peer)
22:04 πŸ”— Smiley has joined #archiveteam-bs
22:06 πŸ”— bitBaron has quit IRC (My computer has gone to sleep. 😴πŸ˜ͺZZZzzz…)
22:08 πŸ”— chirlu has quit IRC (Read error: Operation timed out)
22:16 πŸ”— chirlu has joined #archiveteam-bs
22:39 πŸ”— bitBaron has joined #archiveteam-bs
23:24 πŸ”— Panasonic has joined #archiveteam-bs
23:42 πŸ”— antomatic has joined #archiveteam-bs
23:42 πŸ”— swebb sets mode: +o antomatic
23:45 πŸ”— antomati_ has quit IRC (Ping timeout: 260 seconds)

irclogger-viewer