update README and create a Maintenance.md file

2025-04-13 09:42:01 +00:00 · 2020-11-06 14:40:58 +07:00 · 2020-11-06 14:40:58 +07:00 · c452728f82
commit c452728f82
parent cbd08902c5
2 changed files with 128 additions and 64 deletions
--- a/Maintenance.md
+++ b/Maintenance.md
@ -0,0 +1,128 @@
+
+### Bootstrapping Search **** is this still applicable?
+
+Once you have an elasticsearch server running, you'll want to bootstrap it with feed and story indexes.
+
+    ./manage.py index_feeds
+    
+Stories will be indexed automatically.
+
+If you need to move search servers and want to just delete everything in the search database, you need to reset the MUserSearch table. Run 
+    `make shell`
+
+    >>> from apps.search.models import MUserSearch
+    >>> MUserSearch.remove_all()
+    
+### If feeds aren't fetching:
+  check that the `tasked_feeds` queue is empty. You can drain it by running:
+    `make shell`
+    
+    ```
+    Feed.drain_task_feeds()
+    ```
+    
+    This happens when a deploy on the task servers hits faults and the task servers lose their 
+    connection without giving the tasked feeds back to the queue. Feeds that fall through this 
+    crack are automatically fixed after 24 hours, but if many feeds fall through due to a bad 
+    deploy or electrical failure, you'll want to accelerate that check by just draining the 
+    tasked feeds pool, adding those feeds back into the queue. This command is idempotent.
+      
+
+## In Case of Downtime
+
+You got the downtime message either through email or SMS. This is the order of operations for determining what's wrong.
+
+ 0a. If downtime goes over 5 minutes, go to Twitter and say you're handling it. Be transparent about what it is,
+    NewsBlur's followers are largely technical. Also the 502 page points users to Twitter for status updates.
+ 
+ 0b. Ensure you have `secrets-newsblur/configs/hosts` installed in your `/etc/hosts` so server hostnames 
+    work.
+
+ 1. Check www.newsblur.com to confirm it's down.
+    
+    If you don't get a 502 page, then NewsBlur isn't even reachable and you just need to contact [the
+    hosting provider](https://cloudsupport.digitalocean.com/s/createticket) and yell at them. 
+
+ 2. Check which servers can't be reached on HAProxy stats page. Basic auth can be found in secrets/configs/haproxy.conf. Search the secrets repo for "gimmiestats".
+ 
+    Typically it'll be mongo, but any of the redis or postgres servers can be unreachable due to
+    acts of god. Otherwise, a frequent cause is lack of disk space. There are monitors on every DB
+    server watching for disk space, emailing me when they're running low, but it still happens.
+    
+ 3. Check [Sentry](https://app.getsentry.com/newsblur/app/) and see if the answer is at the top of the 
+    list.
+ 
+    This will show if a database (redis, mongo, postgres) can't be found.
+ 
+ 4. Check the various databases:
+
+     a. If Redis server (db_redis, db_redis_story, db_redis_pubsub) can't connect, redis is probably down.
+        
+        SSH into the offending server (or just check both the `db_redis` and `db_redis_story` servers) and
+        check if `redis` is running. You can often `tail -f -n 100 /var/log/redis.log` to find out if
+        background saving was being SIG(TERM|INT)'ed. When redis goes down, it's always because it's
+        consuming too much memory. That shouldn't happen, so check the [munin
+        graphs](http://db_redis/munin/).
+        
+        Boot it with `sudo /etc/init.d/redis start`.
+     
+     b. If mongo (db_mongo) can't connect, mongo is probably down.
+        
+        This is rare and usually signifies hardware failure. SSH into `db_mongo` and check logs with `tail
+        -f -n 100 /var/log/mongodb/mongodb.log`. Start mongo with `sudo /etc/init.d/mongodb start` then
+        promote the next largest mongodb server. You want to then promote one of the secondaries to
+        primary, kill the offending primary machine, and rebuild it (preferably at a higher size). I
+        recommend waiting a day to rebuild it so that you get a different machine. Don't forget to lodge a
+        support ticket with the hosting provider so they know to check the machine.
+        
+        If it's the db_mongo_analytics machine, there is no backup nor secondaries of the data (because
+        it's ephemeral and used for, you guessed it, analytics). You can easily provision a new mongodb
+        server and point to that machine.
+        
+        If mongo is out of space, which happens, the servers need to be re-synced every 2-3 months to 
+        compress the data bloat. Simply `rm -fr /var/lib/mongodb/*` and re-start Mongo. It will re-sync.
+        
+        If both secondaries are down, then the primary Mongo will go down. You'll need a secondary mongo
+        in the sync state at the very least before the primary will accept reads. It shouldn't take long to
+        get into that state, but you'll need a mongodb machine setup. You can immediately reuse the 
+        non-working secondary if disk space is the only issue.
+        
+     c. If postgresql (db_pgsql) can't connect, postgres is probably down.
+        
+        This is the rarest of the rare and has in fact never happened. Machine failure. If you can salvage
+        the db data, move it to another machine. Worst case you have nightly backups in S3. The fabfile.py
+        has commands to assist in restoring from backup (the backup file just needs to be local).
+    
+ 4. Point to a new/different machine
+    
+    a. Confirm the IP address of the new machine with `fab list_do`.
+    
+    b. Change `secrets-newsbur/config/hosts` to reflect the new machine.
+    
+    c. Copy the new `hosts` file to all machines with:
+    
+       ```
+       fab all setup_hosts
+       ```
+    
+    d. Changes should be instant, but you can also bounce every machine with:
+    
+       ```
+       fab web deploy
+       fab task celery
+       ```
+      
+    e. Monitor `utils/tlnb.py` and `utils/tlnbt.py` for lots of reading and feed fetching.
+
+  5. If feeds aren't fetching, check that the `tasked_feeds` queue is empty. You can drain it by running:
+  
+    ```
+    Feed.drain_task_feeds()
+    ```
+    
+    This happens when a deploy on the task servers hits faults and the task servers lose their 
+    connection without giving the tasked feeds back to the queue. Feeds that fall through this 
+    crack are automatically fixed after 24 hours, but if many feeds fall through due to a bad 
+    deploy or electrical failure, you'll want to accelerate that check by just draining the 
+    tasked feeds pool, adding those feeds back into the queue. This command is idempotent.
+      
--- a/README.md
+++ b/README.md
@ -98,70 +98,6 @@ reader, and feed importer. To run the test suite:
    `make test`


-## Keeping NewsBlur Running **** is this still applicable?
-
-These commands keep NewsBlur fresh and updated. While on a development server, these 
-commands do not need to be run more than once. However, you will probably want to run
-the `refresh_feeds` command regularly so you have new stories to test with and read.
-
-### Fetching feeds **** is this still applicable?
-
-If you just want to fetch feeds once, you can use the `refresh_feeds` management command:
-
-    ./manage.py refresh_feeds
-  
-If you want to fetch feeds regardless of when they were last updated:
-
-    ./manage.py refresh_feeds --force
-    
-You can also fetch the feeds for a specific user:
-
-    ./manage.py refresh_feeds --user=newsblur
-    
-You'll want to put this `refresh_feeds` command on a timer to keep your feeds up to date.
-
-### Feedback **** is this still applicable?
-
-To populate the feedback table on the homepage, use the `collect_feedback` management 
-command every few minutes:
-
-    ./manage.py collect_feedback
-
-### Statistics **** is this still applicable?
-
-To populate the statistics graphs on the homepage, use the `collect_stats` management 
-command every few minutes:
-
-    ./manage.py collect_stats
-
-### Bootstrapping Search **** is this still applicable?
-
-Once you have an elasticsearch server running, you'll want to bootstrap it with feed and story indexes.
-
-    ./manage.py index_feeds
-    
-Stories will be indexed automatically.
-
-If you need to move search servers and want to just delete everything in the search database, you need to reset the MUserSearch table. Run 
-    `make shell`
-
-    >>> from apps.search.models import MUserSearch
-    >>> MUserSearch.remove_all()
-    
-### If feeds aren't fetching:
-  check that the `tasked_feeds` queue is empty. You can drain it by running:
-    `make shell`
-    
-    ```
-    Feed.drain_task_feeds()
-    ```
-    
-    This happens when a deploy on the task servers hits faults and the task servers lose their 
-    connection without giving the tasked feeds back to the queue. Feeds that fall through this 
-    crack are automatically fixed after 24 hours, but if many feeds fall through due to a bad 
-    deploy or electrical failure, you'll want to accelerate that check by just draining the 
-    tasked feeds pool, adding those feeds back into the queue. This command is idempotent.
-      
 ## Author

 * Created by [Samuel Clay](http://www.samuelclay.com).